Doing the Work Right in Data Centers With Checklists
Data centers are complex. Modern economies rely upon their continuous operation. IoT solutions paired with this data center checklist can help!
Before we dive into how to manage complexity in data centers, let’s take a step back and discuss the importance of checklists in such environments.
PGE has a great video about using checklists and references the medical profession aviation and nuclear power stations. Here is the video:
Borrowing from the theme of the video we will extend this to the use of checklists in a data center as well as applying the Internet of Things (IoT) to the task.
A History of Why Checklists Matter
A checklist is used to compensate for the weaknesses of human memory to help ensure consistency and completeness in carrying out tasks. Checklists came into prominence with pilots with the pilot’s checklist first being used and developed in 1934 when a serious accident hampered the adoption into the armed forces of a new aircraft (the predecessor to the famous Flying Fortress). The pilots sat down and put their heads together. What was needed was some way of making sure that everything was done; that nothing was overlooked.
The result was a pilot’s checklist. Four checklists were developed — take-off, flight, before landing, and after landing. The new aircraft was not “too much aeroplane for one man to fly”, it was simply too complex for any one man’s memory. These checklists for the pilot and co-pilot made sure that nothing was forgotten.
Data Center Complexity
Data centers are complex entities. They are too much technology for one man to operate. They consist of multiple functions that combine that allows a data center to operate and provide services. These functions are:
- The data center white space where the Information Technology kit is located. The white space often consists of a raised floor which play an important part in the cooling function described below.
- The power supplied which consists of utility power, standby generators, and uninterruptible power supplies (UPS). The standby generators require fuel management and the UPS need battery strings. The internal distribution of power is typically handled by power distribution units (PDU) connected to a busway system.
- Cooling where the primary function is to provide thermal management to the white space. The cooling systems consume power and the rate that this power is consumed is one of the important efficiency metrics for a data center. The remaining power which is used in the white space is often known as the critical compute load.
- The actual data center site location and building shell. Mostly building components but an important consideration is the physical security of the whole data center which includes access and surveillance of the site.
- Fire detection and suppression systems. The suppression systems are invariably based on gas. Important considerations are also that walls and doors and various areas of the data center are segmented and that the walls and doors between these areas are fire proof.
- Data center are permanently connected to the Internet and as such telecommunications and data center networks are integral functions. The components include data center networks (often using a spine and leaf architecture), telco meet-me-room, conduits and access man-holes.
All these functions need regular maintenance and upkeep and one of the operational tools is to use checklists in the same manner as shown above for medicine, aviation and nuclear.
Basic Data Center Checklist
The following is a rudimentary checklist example associated with power.
The rating and weight are typically based on a scale from 1 to 5 and a score is thus achieved for the function. This score is then evaluated and categorized as follows:
- Satisfactory : Components evaluated as adequate, appropriate and effective to provide reasonable assurance that data centre risks are being managed.
- Low priority: A few specific components and weaknesses was noted. Components evaluated as adequate, appropriate and effective to provide reasonable assurance that data centre risks are being managed.
- Moderate: Numerous weaknesses were noted. Components evaluated are unlikely to provide assurance that data centre risks are being managed.
- Critical: Components evaluated are inadequate, inappropriate or ineffective to provide reasonable assurance that data centre risks are being managed.
Integrating IoT into the Data Center
Each of the data center components can be monitored by IoT sensors. These IoT sensors will provide metrics which are used by an IoT platform to determine that the data center is operating within suitable parameters. This data becomes the core input to be used by the executed data center checklists which need to be completed daily.
Many data centers rely on legacy SCADA systems for metrics and there is a definite requirement to refresh these systems to the next generation IoT based devices and platforms.
The Adoption of Wearables
At present many data centers are reliant on checklists being executed by data center engineers using manual clipboards. Additionally, metrics are most often visualized and presented in Network Operation Centers (NOC) or Security Operation Centers (SOC).
Personally, as a professional working in data center environments, I have spent countless hours completing assessments and checklists in a clumsy manner using Excel spreadsheets on a laptop. Besides not being automated, it is a difficult if not near impossible task to aggregate the collected data over time and construct high levelreports.
The introduction of IoT based wearables allows data centre engineers to engage in operations hands free while being presented with information and metrics in a heads-up display. These wearables allow:
- Display and executing the checklist items via an integrated application in the heads-up display;
- Recording of data center components using video or pictures;
- Oversight of engineers by skilled and experienced managers who can remote engage and view operations using either webex or skype;
- Access data center equipment manuals via voice commands for display in the heads-up display; and
- Direct viewing in the heads-up display of the component’s metrics obtained from the IoT platform.
Improved Operations
The impact of used IoT in a data center as described is that when a checklist is executed and the result is evaluated, corrective action can be immediately triggered. This allows any human error to be minimized and the result is improved data center operations which can be measured via improved availability statistics.