Service checklist - Meerkat

(Also includes removal of old service/system at the end)

Name / Description for System/Service Being Added (or, if indicated, removed).

 

Monitoring - Icinga tests in place for

Host checks

 

Service Checks – Fabric layer

 

 

Monitoring - Call-out

Should the service call-out?

 

Is documentation provided on the call-outs?

 

Monitoring – Grafana

Grafana/InfluxDB monitoring needed/in-place?

 

Documentation & Support

Is there any description of the system and recovery procedures?

 

Do at least two people know enough about the system to resolve issues in the absence of one of those responsible for it?

 

Does the system need an entry in the GOC DB?

 

 

Quattor

Have all the relevant Quattor templates been independently reviewed?

 

Set-up & Security

Is IPv4/IPv6 dual-stack enabled? (If not, please justify why IPv6 cannot be enabled).

Does the system have the appropriate name?

 

 

Are any DNS entries (e.g. aliases) needed?

 

 

Has an audit been done of security requirements for the service/system?

 

Have any resulting security issues been addressed (e.g. configuring iptables, restricting certain access)?

 

Is a firewall hole needed / enabled?

 

Have you checked there are no unexpected firewall holes left from a previous system at this IP address?

 

If the service requires passwords: Have these been set-up and noted securely?

 

Is there a process to monitor for password expiry and/or update as required?

 

Log rotations / copying to central loggers of all appropriate log files configured?

 

Remote console access set-up. (IPMI or VM solution)

 

Is Pakiti set-up on the machine?

 

E-mail: Is the machine configured to send mail to csf-mail.rl.ac.uk?

 

Architecture System Configuration

Is the service run on a system that is powerful enough?

 

Do all the disk partitions have sufficient space (e.g. allowing for log files to grow when system busy etc.)

 

Is the service run on a system that has disk resilience if needed?

 

Does it need UPS and/or a dual power supply?

 

Is ACPI enabled so can Power Down over IPMI?

 

If the service is on multiple servers, if possible or appropriate are these placed on different

Service is running on single VM within the Cloud currently

Network switches?

Power phases

PDUs

Additional Questions for Virtual Machines

System requirements appropriate for VM

OK for brief outage

 

Does not require persistent storage

No excessive I/O etc.

 

Configuration

Correct option set for action on hypervisor restart (N/A for VMWare VMs).

 

 

System requirements (e.g. RAM, Number of CPUs) are documented so that if a new VM has to be set-up the relevant parameters are known.

 

Check the system has the VMWare tools installed.

 

Check naming of the VM and its ownership in the hypervisor is appropriate.

 

 

Live migration tested?

 

Verified a second person can re-instance the server from scratch.

Architecture Resilience

Multiple instances needed?

 

Automatic Fail-over / hot standby needed?

 

Fail-over to equipment in Atlas building needed/configured?

 

Backup

Is a backup needed, and if so at what frequency?

 

What Backup system is in use (Amanda/Atlasbackup)?

 

Has the backup been tested?

Communications

Has the VO been informed (if appropriate)?

 

Procedures

Has a Change Control Request been put in for the new service?

 

Are there any changes to standard procedures that result? (E.g. POD/POC needing to be informed.)

Clean-up of old systems

Is this a replacement for a service / system that can be retired?

 

Final Checks

Has the system been rebooted in its final configuration to ensure all services start up OK?

 

Does the machine have the latest kernel and other relevant updates?

 

Old System Removal Check-list.

Closely linked to the setting up of a new system are task relating to the removal of an old one.

Clean-up of old systems:

Old Rundeck system has been removed

Removal of Nagios checks (service and fabric layer) and clean up of event handlers.

Removal from CACTI & Ganglia.

 

Check if it’s a Ganglia collector node.

 

Does any documentation need updating to replace or remove the ‘old’ system/service? E.g. removal from Call-out docs.

Have any appropriate firewall holes closed?

Are any DNS entries (e.g. aliases) no longer needed?

Have VOs been informed (if appropriate)?

 

Has a Change Control Request been put in for the removal of the service (if appropriate)?

 

Does the system need to be removed from the GOC DB?

 

Final steps

Service owner stops their services from starting up on server (chkconfig off) & set this in Quattor if appropriate.

 

Request Fabric to power off (in rack).

 

Once appropriate time elapsed hand back to Fabric for decommissioning.

 

Â