Improving redundancy in your network - gateway redundancy

Most business networks rely on their internet connection. And by this, I mean they rely on it heavily. Not only you may lose business by being offline (and thus, being unreachable by your customers), productivity goes almost immediately to zero. Therefore with most businesses, it is crucial to have a working internet connection.

Easier said than done. No intenet provider has a 100% uptime, but even if one did have such a good service, one would have to pay an expensive price for that. Even if you look at home internet access vs. business internet access and compare them, the cost is significantly higher when it comes to business access. There is an upside though: reliability. SLA (Service Level Agreement) specifies how fast the provider needs to fix your broadband access, should it fail because of them. Obviously they won't fix it if the problem is not within their jurisdiction. Typical SLA for home users is around three working days, but for business customers, it's more like NBD (next business day), or even better: 4 hours. That's quite impressive actually.

This article is the first of a series to come looking through how you can improve the critical parts of your network to minimize downtime. This is not only essential to circumvent ISP problems. Wouldn't it be great if your systems administrator didn't have to work overnight, so you can save on costs? With a fully resilient network, this is entirely possible, so having one is actually cheaper than you think. In this article, we'll focus on external connectivity: connection to the internet only. Let's face it. Even if your connection goes down, one of two things can happen. It gets fixed within the SLA, or it doesn't. Eithey way, you'll be offline for some time. Times like this you wish you had a second connection to fail over to. So... why don't you get one?

In it's very basic form an existing router without the capability of handling a second broadband conenction, still can serve you good. When the usual connection is broken, just swap the cables and keep surfing. While this works, it's far from being ideal. This is for a number of reasons:

  • you are paying for two broadband connections, but at no time using both
  • should the second internet connection be broken as well, you would never realize this until you actually need to use it
  • failing over from one broadband connection to another requires on-site presense and human iteraction, which takes time
  • troubleshooting can take longer, in case you fail over and you still don't have connectivity
  • manual configuration of your router is required, or you need to have two routers

    Why don't we dig deeper and see how you can have a cost-effective way of being redundant? Redundancy can be achieved in a number of ways, the only question is money. You can have two business internet connections in the same router. Or in two separate routers, giving an additional layer of safety (should one router go nuts, you still have another). You can have one business internet connection, and one home connection. Or you can have two home connections - after all, the whole point is not being bothered should it go down. Things are easy if you only access services on the internet and can get a bit complicated when you also provide services over the internet. Your IP address will change, afterall. Or, maybe not, when using BGP, you still can have the same IP range accessed over your secondary connection. Technically there is a solution to everything, and as I said: money is the only limitation.

    Arriving to the key point (money), let's pick a suitable router. Options that you have are:

  • any linux based PC
  • soho router designed for multiple connections
  • high quality branded routers, such as ones from Cisco

    Starting with the number of routers needed, you can have one router responsible for both ISP connections, or you can have two, each responsible for one. I'd recommend the latter. This is because there is no use having two connections in the same device: if the device breaks, both connections go down. This is what is called a single point of failure. You don't want single point of failures, the whole purpose of this article is to eliminate as many as possible.

    Obviously, when you have two devices, new problems come to life. How will they know which connection to use? How will the rest of the network know which device to use? What if the devices change roles, how will the rest of the network be notified? These are all valid questions, and there is a solution for all. Let's quickly eliminate soho routers, I'm not a fan. These devices can be tricky. In most cases, they do not have the ability to communicate with another router, so having two and switching over automatically won't be possible. They are designed to be on their own and handle two connections. Sometimes there is a limitation: a seconday connection is only used when the primary fails. So you do end up paying for two but using only one at any given time. And also, should the secondary have a problem, you won't even know until it's too late. For a serious business like yours, these are not great devices to go with.

    A linux based PC is an interesting choice. If you have the knowledge to script everything, that's great. You can also choose to go with a distibution that is created for the purpose. If you choose to do so, you may see the same limitations as with the soho routers. Scripting is probably what you'll end up with. Which requries above average knowledge about linux in general, scripting, networking and protocols like VRRP. You may end up paying more for your sysadmin to do this than you save on using a simple PC. And let's not forget: you need two PCs for the task for obvious reasons.

    As you may have guessed, for the rest of the article, we're exploring your options when using Cisco routers. Cisco routers can be expensive, but you also can get them used from ebay. It is not difficult to get a good deal, so you might as well buy two. They don't have to be identical, you can have two totally different types. The only important thing is to stay within the family: combine routers with routers or firewalls with firewalls. No mix and match as they speak different protocols and redundancy is achieved in different ways.

    If you have access to feature navigator, look for HSRP as the supported protocol, this will be a key element. Most routers do support this ancient feature. HSRP is basically the Cisco propietary implementation of VRRP, they do the same thing. The abbreviation stands for Hot Standby Routing Protocol. This is designed for two or more routers being in the same network, all acting as gateways. HSRP is responsible to pick a router that will be the active one, the rest of them will remain as standbys. It is also responsible to announce this to the network, so everybody will end up using the correct gateway to the internet. Does this mean the rest of your network needs to understand HSRP? No, it does not.

    HSRP works on Cisco routers (VRRP being an open source implementation of the same thing, works on anything) only. The active and standby routers constantly polling each other. Should the active fail to respond in a timely manner, one of the standbys assume it has failed and takes over. Timers are configurable and you can even go sub second, meaning a total failover will be transparent to the rest of the network. So in layman's terms: you have two routers, A and B. A being primary, B being secondary. What B does, is constantly nagging router A. Are you ok? Router A responds with: yes, i'm fine. And this happens over and over again. Are you ok? Yes. And now? Yes. And now? Yes. As soon as router A fails to respond within the configured timeframe, router B goes: oh, router A is not responding! It must have failed! Quick, let's take over the primary role. And so router B does, immediately. Since both router A and B have their own internet conenction, as soon as the packets from the inside network make it to either of the routers, they will be passed onward to the internet. Now let's see how they announce themselves to the inside network and how they force traffic flow to end up at the primary router. At no time there are two primary routers, only a single one, but there can be many secondary routers.

    Let's see what is happening from an inside workstation's point of view. Workstation X wants to send something to the internet. So it figures out the IP address of the destination. Based on its own IP address and netmask, it quickly realizes that the gateway is needed to send the packets to. So it looks up its configured gateway IP address (this can be set manually or sent from a DHCP server) and sends the packet. For the sake of simplicity, let's assume that we have the internal network of 10.0.0.0/24, router A being 10.0.0.2, router B being 10.0.0.3 and the workstation being 10.0.0.4. First question: if the workstation 10.0.0.4 has the gateway 10.0.0.2 configured (router A), how will it send packets to router B, when it has taken over the primary role? The workstation not knowing router A has failed, it will keep sending packets to it, and does not know anything about router B.

    The solution to this problem is called a virtual IP address, a key point of HSRP. Router A and router B have a total of three addresses alltogether. 10.0.0.2 belonging to router A, 10.0.0.3 belonging to router B, and a third one that is shared by them. This can be 10.0.0.1. This third IP address only exists on the primary router and when it fails, the secondary not only takes over the primary role, but the third IP address as well. In another words: the third IP address is always handled by the primary, active router. The gateways therefore need to be configured with the gateway address of 10.0.0.1. This is why the workstation does not have to know which router is active, it will always use the virtual address and it is the routers' responsibility to handle it.

    There is one single question remaining. As we all know, computers on the same subnet use layer 2 addressing: MAC addresses, not IP addresses. When the virtual IP is claimed by router B, router B has a different MAC address. Meanwhile, the workstation has router A's MAC address cached along with the virtual IP address 10.0.0.1. Until this MAC cache times out, even though the IP address is now handled by router B, no traffic will actually reach router B, because everything is still addressed to router A's MAC address. Correct? No.

    It's not only a virtual IP address that is shared between router A and B. There is also a MAC address, for this very reason. The MAC address that pairs with the virtual IP address is also virtual and is also shared between the routers. Just as with the IP address, the third MAC address is also always claimed by the single primary router. So the workstations won't even know it's a different router handling the traffic. They can keep sending traffic to the same IP/MAC pair and it will always reach whichever router is active/primary.

    Right, now we have covered all the bases and should a router fail, it will not cause a problem in your network, you still will have network connectivity. You can even grab the power cord and pull out from the active router, nobody will notice, traffic will keep on flowing. But is this the only source of problems? Obviously not. What if the router never fails, but the network conenction to the internet does? Router A in this example will still happily answer all questions from router B and will balackhole traffic, as there will be no working internet connection. Yet, still, router A will remain primary and nothing will work. Correct? No.

    Let me introduce two things: router priority and IP SLA. HSRP also knows about a metric value called the router's priority, defaulting to 100. The higher the number the higher the priority, this is the value which will decide which router to become primary. This you need to set up manually and routers when doing the 'hello' conversation will allso tell other HSRP routers their priority value. The router which has the highers number, will become active, as long as nobody else announces a hihger number than his. Simple. So for starters, you can configure router A with the priority value of 120, and router B with the priority value of 110, this will make router A primary and router B will know this as well.

    Priorities do not remain static, they can be adjusted automatically. It is extremely easy to set up a router so should the WAN connection go down, it will automatically decrement its priority value by a certain value. Say 15 for example. Given that router A in this example has priority value of 120, router B has 110, when router A's interface towards the internet go down, it will reduce it's priority value from 120 to 105 (120-15). This new value is no longer the highest, therefore router B will automatically take over without router A failing. As you can see, it is enough to lose an interface and you can switch to secondary.

    I know your next question: what if the interface towards the ISP won't go down, but still, the internet connection won't work? This is a pretty common problem, having an ISP problem 'somwhere up' in the chain. No interface to go down, as the problem is further up somewhere in the cloud, router A does not fail either. How will router B realize it needs to take over? The answer is IP SLA. Cisco routers support this, it is a configuration where the router performs a certain procedure to verify something. Such as pinging towards the internet, 8.8.8.8 for example. And as you could have guessed by now: yes, the router is able to reduce its priority value if and when there is no response to these ping probes. Clever, huh? So router A can not only constantly answer router B's 'are you ok?' question. It can simultaneously ping out to the internet and query itself: 'am i ok?' by expecting pings back. Should the pings not come back in a timely manner, router A can decide not to be fit for a gateway role anymore and will decrement its own priority by a configured value - say 15. This will make itself lose the highest value and router B will automatically take over. Subject to having its own connections working fine of course.

    All of the above are brilliant methods to keep an active connection going, and because of the constant pinging towards the internet, you will immediately be notified should either connection have a problem. Should the secondary connection fail, you can work on fixing it even before it's due to go live.