Service checklist - Meerkat
(Also includes removal of old service/system at the end)
Name / Description for System/Service Being Added (or, if indicated, removed). |
|
Monitoring - Icinga tests in place for
Host checks | |
|
Service Checks – Fabric layer | |
|
Monitoring - Call-out
Should the service call-out? | |
|
Is documentation provided on the call-outs? | |
|
Monitoring – Grafana
Grafana/InfluxDB monitoring needed/in-place? | |
|
Documentation & Support
Is there any description of the system and recovery procedures? | |
|
Do at least two people know enough about the system to resolve issues in the absence of one of those responsible for it? | |
|
Does the system need an entry in the GOC DB? | |
|
Quattor
Have all the relevant Quattor templates been independently reviewed? | |
|
Set-up & Security
Is IPv4/IPv6 dual-stack enabled? (If not, please justify why IPv6 cannot be enabled). |
Does the system have the appropriate name? | |
|
|
Are any DNS entries (e.g. aliases) needed? | |
|
|
Has an audit been done of security requirements for the service/system? | |
|
Have any resulting security issues been addressed (e.g. configuring iptables, restricting certain access)? | |
|
Is a firewall hole needed / enabled? | |
|
Have you checked there are no unexpected firewall holes left from a previous system at this IP address? | |
|
If the service requires passwords: Have these been set-up and noted securely? | |
|
Is there a process to monitor for password expiry and/or update as required? | |
|
Log rotations / copying to central loggers of all appropriate log files configured? | |
|
Remote console access set-up. (IPMI or VM solution) | |
|
Is Pakiti set-up on the machine? | |
|
E-mail: Is the machine configured to send mail to csf-mail.rl.ac.uk? | |
|
Architecture System Configuration
Is the service run on a system that is powerful enough? | |
|
Do all the disk partitions have sufficient space (e.g. allowing for log files to grow when system busy etc.) | |
|
Is the service run on a system that has disk resilience if needed? | |
|
Does it need UPS and/or a dual power supply? | |
|
Is ACPI enabled so can Power Down over IPMI? | |
|
If the service is on multiple servers, if possible or appropriate are these placed on different
Service is running on single VM within the Cloud currently
Network switches? |
Power phases |
PDUs |
Additional Questions for Virtual Machines
System requirements appropriate for VM
OK for brief outage | |
|
Does not require persistent storage |
No excessive I/O etc. | |
|
Configuration
Correct option set for action on hypervisor restart (N/A for VMWare VMs). | |
|
|
System requirements (e.g. RAM, Number of CPUs) are documented so that if a new VM has to be set-up the relevant parameters are known. | |
|
Check the system has the VMWare tools installed. | |
|
Check naming of the VM and its ownership in the hypervisor is appropriate. | |
|
|
Live migration tested? | |
|
Verified a second person can re-instance the server from scratch. |
Architecture Resilience
Multiple instances needed? | |
|
Automatic Fail-over / hot standby needed? | |
|
Fail-over to equipment in Atlas building needed/configured? | |
|
Backup
Is a backup needed, and if so at what frequency? | |
|
What Backup system is in use (Amanda/Atlasbackup)? | |
| |
Has the backup been tested? |
Communications
Has the VO been informed (if appropriate)? | |
|
Procedures
Has a Change Control Request been put in for the new service? | |
|
Are there any changes to standard procedures that result? (E.g. POD/POC needing to be informed.) |
Clean-up of old systems
Is this a replacement for a service / system that can be retired? | |
|
Final Checks
Has the system been rebooted in its final configuration to ensure all services start up OK? | |
|
Does the machine have the latest kernel and other relevant updates? | |
|
Old System Removal Check-list.
Closely linked to the setting up of a new system are task relating to the removal of an old one.
Clean-up of old systems:
Old Rundeck system has been removed
Removal of Nagios checks (service and fabric layer) and clean up of event handlers. |
Removal from CACTI & Ganglia. | |
|
Check if it’s a Ganglia collector node. | |
|
Does any documentation need updating to replace or remove the ‘old’ system/service? E.g. removal from Call-out docs. |
Have any appropriate firewall holes closed? |
Are any DNS entries (e.g. aliases) no longer needed? |
Have VOs been informed (if appropriate)? | |
|
Has a Change Control Request been put in for the removal of the service (if appropriate)? | |
|
Does the system need to be removed from the GOC DB? | |
|
Final steps
Service owner stops their services from starting up on server (chkconfig off) & set this in Quattor if appropriate. | |
|
Request Fabric to power off (in rack). | |
|
Once appropriate time elapsed hand back to Fabric for decommissioning. | |
|