Clustering and High Availability for PeopleSoft 8.4
This post discusses various components used for scalability and high availability of internet services. Instead of covering all possible configurations/devices, the discussion shall be limited to systems that apply to PeopleSoft architecture and have been tested in the field. The complexity and cost of the system is largely dependent on the required level of Quality Of Service (QOS) of the system. The QOS of a system specifies the level of scalability and fault tolerance the system would provide. In the simplest case there is one server with no guaranteed uptime of service and on the other hand we can build a system to provide 24x7 with better than 99.999% availability i.e. telecommunication grade service. Most of our customers will choose a level of service somewhere in between
based on their budget.
Manufacturers of network devices provide MTBF (Mean Time Between Failure) numbers which should be carefully considered. The higher the number the better but it costs more. Do not make a judgment solely based on MTBF without also considering MTTR (Mean Time To Repair) because units that are difficult to repair will eventually contribute to higher down time. The value of MTTR is difficult to calculate because it factors in issues like time to diagnose a problem, availability of parts, engineer’s knowledge of the affected unit etc. Calculate availability of overall infrastructure as:
Availability of a component x, A x = MTBF/(MTBF+MTTR)
Availability of a redundant component group of x and y is A x+y = 1 – ((1 – A x) * (1 – A y))
Availability of two redundant groups in series to complete a system A overall = A x+y * A p+q
The various components to consider in the system are:
Internet Connectivity – For high availability internet connectivity should be obtained from multiple (at least two internet service providers). In the event of a failure of one of the providers users would still be able to access the system via the second provider. The key feature to look for is diversity in connectivity between the two providers,
e.g. consider installing leased line for primary provider and satellite or cable modem for the backup. Smaller sites could setup dial backup on backup router, for a more cost effective solution. With cooperation from both the providers it is possible to run full BGP 4 (Border Gateway Protocol) routing protocol for advanced failure detection and failover.
Routers –The router needs to be fault tolerant. At a minimum the network architecture should be dual redundant.The routers could be configured to run in primary/backup mode running either Virtual Router Redundancy Protocol
(VRRP) or HSRP (Hot Standby Routing Protocol) for Cisco routers. Under these protocols each unit in a pairsends out packets to see if the other will respond. If the primary fails the backup will take over its functions. Most routers also have certain firewall capability, e.g. packet filtering, port blocking etc. These features should beenabled for added security whenever possible.
Customers using colocation will generally not have access to the router because this is part of the colocation
providers equipment. In these cases all security features must be implemented within the system using additional equipment (firewall, loadbalancer NAT, reverse proxy server etc).
Switches/VLANS – Switches interconnect all the network devices in a system. To build a redundant system at least two physical switches should be used. In the discussion that follows layer 2 switches are used. Failover for these devices can be configured by using the spanning tree protocol and connecting the devices with a trunk link.
The trunk must use redundant interconnect to prevent the LAN from splitting in two. In the configurations shown in this document we have avoided cross connecting switches with routers and hosts. This is a simple configuration that all routers and hosts will support but in an event of a failure of one of the switches half of the servers (all servers connected to the affected unit) in the network are taken offline.
Firewalls – The firewall is possibly the most difficult device to incorporate on a system that is being designed for high availability. In most systems if not properly designed it would soon become the bottleneck. It is not uncommon for extremely high throughput systems to avoid a firewall at the incoming internet entry point. Instead a combination of routers, loadbalancers and reverse proxy servers are used to achieve the necessary level of security for the first tier of the system. High availability with firewalls can be tricky too, most vendors provide some means of clustering capability that allows either an array of identical servers dividing up the load among themselves or an active/active pair of units.
In the following sections we use a 3-pronged firewall. In this device the firewall has 3 interfaces, one for Internet,one for Intranet and one for the DMZ services. This configuration has a single point of protection (security failure) limitation for the Intranet site. If this is not acceptable the 3-pronged firewall should be preceded with another pair of redundant firewalls. It is possible to run loadbalancers to distribute load among identical firewall units (FWLBS) for greater scalability but the configuration is not simple. To implement the 3-pronged firewall with redundancy it will take 6 extra loadbalancers and 6 extra switches/VLANS to implement.
Loadbalancers – A highly recommended device to achieve high scalability and fault tolerance at a reasonable cost. The current street price for these units range from $5,000 to $50,000. Some units starting at $12,000 can be configured to replace a firewall and provide a hardware SSL accelerator which provides security and scalability at a reasonable cost. Again, a pair should be deployed for redundancy. On most loadbalancers each physical unit can be configured into multiple logical units. Network security and architecture permitting the logical units can be used to loadbalance multiple applications.
Reverse Proxy Servers – Reverse Proxy Servers (RPS) are generally used as part of the security infrastructure. Most sites will deploy them if there is a security concern about IP packets from untrusted users to make it to the production webservers. A RPS provides protection from attacks that are launched to take advantage of vulnerability such as buffer overflow, mal formed packets etc. This also adds another tier to the security architecture. Other sites may use them as a single signon portal server, one which allows RPS authenticated users to access multiple internal systems with varying authentication schemes to be accessed without individual authentication to those systems.
RPS is almost always loadbalanced using a loadbalancer. For PeopleSoft applications a sites domain name
mapping will map to the loadbalancer for the RPSs. In this document an example site portal.corp.com should be mapped to a VIP 123.123.123.100 by external DNS systems and this VIP should be mapped to the RPS loadbalancer.
Servers – Servers themselves have a number of fault tolerant mechanism built into them, e.g. redundant network cards, raid array, dual power supply, fault tolerant motherboard etc. As a minimum there should be at least two servers configured as a dual redundant system. Other than the vendor recommended database-clustering PeopleSoft applications do not use any OS provided server-clustering mechanism. This provides greater flexibility for our customers to pick the best of the breed HW/SW solutions.
DNS Servers – A PeopleSoft production system should avoid using DNS name resolution whenever possible. It may be necessary, however, for PeopleSoft Portal or Applications Messaging to be able to access remote servers. If this is a requirement and if adding an /etc/hosts entry for those name(s) is not convenient only then should DNS name resolution from a local server be considered. Under no circumstances should the local DNS servers be allowed to receive DNS updates from remote servers. The local DNS server should also be prevented from sending DNS queries to the remote server for local addresses. So, in other words, the local DNS server should only query the remote server for addresses that are outside the local domain of the site. High availability is maintained by running a primary and a backup DNS host, connected to two separate switches. All hosts that need access to DNS service should be configured to use a primary and backup DNS host.
Storage – All PeopleSoft data (configuration meta data) and user data is stored in databases. The databases
should be stored in some sort of a fault tolerant device e.g. a RAID (Redundant Array of Inexpensive Disks) device. At a minimum the storage subsystem should be chosen to use data striping, e.g. RAID 5 for low cost systems and RAID 10 i.e. 0+1 or 1+0 for high performance systems
Power Supply – A minimum of two UPS (Uninterruptible Power Supply) is recommended. For systems with higher availability requirement the UPS should be backed by power generators and power drop from two separate substations.
Disaster Recovery Plans – Finally all installations small or large must create a disaster recovery plan. For large installations this should include creation of a second data center at a distant geographic location. The current version of the document does not address all aspects of disaster recovery.
VIPs – VIPs are not physical devices. These are IP addresses where the world points its browsers to access the services. These IP address could point to a real webserver in the simplest case. In most of the systems described in this document it will point to a logical service implemented using firewalls, loadbalancers, proxy servers and real servers. A VIP is also the IP address that the sites DNS name shall map to. In this document an example site portal.corp.com is mapped to a VIP 123.123.123.100 by external DNS systems.