Business Continuity & Disaster Recovery
Business Continuity Plan
Summary
The Patients Know Best (PKB) Business Continuity Policy ensures the continuity of critical operations during disruptions, aligning with its Disaster Recovery Plan. It prioritises safeguarding services, systems, and stakeholders, minimising downtime's impact on reputation and finances. Key principles include proactive risk assessment, regular testing, and ensuring operational resilience via multi-cloud environments. Responsibilities span leadership, HR, security, and technical teams, supported by up-to-date training and resources. Recovery objectives focus on mission-critical processes, prioritised through Business Impact Analysis (BIA). PKB maintains robust data recovery, enabling work-from-anywhere operations, with real-time updates on service status provided via a public platform.
Purpose
The Patients Know Best (PKB) Business Continuity Policy (BCP) is focused on maintaining the continuity of services, systems and processes. To return to a normal operating state as soon as possible, taking into account the impact of any delay on PKB’s quality of service, reputation and finances, in conjunction with the Disaster Recovery Plan. The key principles of this BCP are as follows:
To take all reasonable steps to avoid any activity that might adversely impact service continuity.
To ensure continuity planning is an intrinsic component of PKB’s functional methodology and operational approach.
To ensure employee, stakeholder, customer and provider information is current and sufficient.
To make advance arrangements for the recovery of service critical components.
To make advance arrangements to relocate or reorganise operations to allow critical processes to continue.
Providing resilience for information systems and data, or alternative ways of working in the event of their failure.
All systems and processes must be in line with PKB's Information Governance and Security Policy.
To protect employees, customers and third-parties where an event is likely to impact their safety.
To apply robustness and rigour to BCP testing and for this testing to have a regular and prioritised schedule of adherence.
To facilitate and keep up-to-date BCP training materials and regular BCP training sessions.
To ensure regularity and method in the sufficient updating of the BCP/ DRP plans; be those organisational, procedural, provider-centric, systems or services.
Policy Requirements
Patients Know Best policy requires that:
PKB's BCP and the objectives herein are understood by all stakeholders and employees.
A plan and process for business continuity, including the backup and recovery of systems and data, must be defined and documented.
That employee, provider and system plans are defined to underpin recovery steps in the event of an interruption in service, function and/ or core activities.
The Business Continuity Plan shall be simulated and tested at least once a year. Metrics shall be measured and identified recovery enhancements shall be documented to improve the process.
Security controls and requirements must be maintained in two separate Cloud environments, Confluence and Drata.
Roles and Responsibilities
This Policy is maintained by the Patients Know Best Information Governance Teams and SIRO. All executive leadership shall be informed of any and all contingency events.
Line of Succession
The following order of succession ensures that decision-making authority for the Patients Know Best Business Continuity Plan is uninterrupted. The CEO is responsible for ensuring the safety of personnel and the execution of procedures documented within this Plan. The Head of Engineering is responsible for the recovery of Patients Know Best technical environments. If the CEO or Head of Engineering is unable to function as the overall authority or chooses to delegate this responsibility to a successor, the Business Operations Lead shall function as that authority or choose an alternative delegate.
Response Teams and Responsibilities
The following teams have been developed and trained to respond to a contingency event affecting Patients Know Best infrastructure and systems.
HR is tasked with promoting the safety, well-being, and support of all Patients Know Best personnel during a crisis or incident or emergency, recognising the limitations of directly ensuring these for remote workers. HR plays a key role in supporting the safety, well-being, and care of all Patients Know Best personnel during a critical event, while acknowledging the limitations of directly ensuring these for remote workers.
A cross-functional DR (Disaster Recovery) Team is defined with a designated IC (Incident Coordinator) to ensure recovery and security of critical systems and assets
Each function within PKB, defined as critical to operational stability or service delivery within the BCP, must maintain a register of services, dependencies, suppliers and vendors to ensure the efficacy of the BCP related to their defined functional responsibilities.
All organisations within PKB with responsibility for a critical service must have a defined BCP coordinator responsible for updating registers as an ongoing organisational commitment.
DevOps is responsible for assuring all applications, web services, platforms, and their supporting infrastructure in the Cloud. The team is also responsible for testing re-deployments and assessing damage to the environment. The team leader is the Head of Engineering.
Security is responsible for assessing and responding to all cybersecurity related incidents according to Patients Know Best Incident Response policy and procedures. The security team shall assist the above teams in recovery as needed in non-cybersecurity events. The team leader is the Security Officer.
Members of the above teams must maintain local copies of the contact information of the Business Continuity Plan succession team. Additionally, the team leads must maintain a local copy of this policy in the event Internet access is not available during a disaster scenario.
Policy
Operational Resilience Strategy
Patients Know Best's strategies for operational resilience take a holistic approach to the company and its business process and are developed with consideration of acceptable limits regarding the company's risk appetite and tolerance. These strategies are developed through:
Risk assessment: to identify internal and external threats to the company's ability to conduct business particularly in the areas of technology, human resources, facilities, and third parties;
Vulnerability analysis: to identify weaknesses that could raise the level operational disruption risk;
Business impact analysis:
to define mission critical business processes, along with the technology, people and facilities that enable them; and,
to assess the potential effects on the company if those processes cannot be performed.
Business Impact Analysis (BIA)
The BIA will determine the criticality of business activities to ensure operational resilience and business continuity during and after a disruption. The BIA will help identify and prioritise system components by correlating them to the business processes that the system supports. It will allow for the characterisation of the impact on the processes if the system becomes unavailable. The BIA has three steps:
Determine business processes and recovery criticality. Business processes supported by the system are identified and the impact of a system disruption to those processes is determined along with outage impacts and estimated downtime. The downtime should reflect the maximum that an organisation can tolerate while still maintaining the mission.
Identify resource requirements. Realistic recovery efforts require a thorough evaluation of the resources required to resume mission/business processes and related interdependencies as quickly as possible. Examples of resources that should be identified include facilities, personnel, equipment, software, data files, system components, and vital records.
Identify recovery priorities for system resources.Based upon the results from the previous activities, system resources can more clearly be linked to critical mission/business processes. Priority levels can be established for sequencing recovery activities and resources.
See Appendix A for the BIA breakdown (AWAITING FEEDBACK).
Work Site Recovery
Patients Know Best’s software development organisation has the ability to work from any location with Internet access and does not require an office provided Internet connection.
Application Service Event Recovery
Patients Know Best maintains a status page to provide real time updates and inform customers of the status of each service. The status page is updated with details about an event that may cause service interruption / downtime. Patients Know Best’s status page:
APPENDIX A
Business Impact Analysis
The BIA will determine the criticality of business activities to ensure operational resilience and business continuity during and after a disruption. The BIA will help identify and prioritise system components by correlating them to the business processes that the system supports. It will allow for the characterisation of the impact on the processes if the system becomes unavailable. The BIA has three steps:
Determine business processes and recovery criticality. Business processes supported by the system are identified and the impact of a system disruption to those processes is determined along with outage impacts and estimated downtime. The downtime should reflect the maximum that an organisation can tolerate while still maintaining the mission.
Identify resource requirements. Realistic recovery efforts require a thorough evaluation of the resources required to resume mission/business processes and related interdependencies as quickly as possible. Examples of resources that should be identified include facilities, personnel, equipment, software, data files, system components, and vital records.
Identify recovery priorities for system resources.Based upon the results from the previous activities, system resources can more clearly be linked to critical mission/business processes. Priority levels can be established for sequencing recovery activities and resources.
See Appendix A for the BIA breakdown.
Work Site Recovery
Patients Know Best’s software development organisation has the ability to work from any location with Internet access and does not require an office provided Internet connection.
Application Service Event Recovery
Patients Know Best maintains a status page to provide real time updates and inform customers of the status of each service. The status page is updated with details about an event that may cause service interruption / downtime. Patients Know Best’s status page:
Outage Impacts
Impact categories and values characterise levels of severity to the company that would result for that particular impact category, if the business process could not be performed. These impact categories and values are samples and should be revised to reflect what is appropriate for the organisation.
Outage Impact Category Key | |||
Cat 1 | Cat 2 | Cat 3 | Cat 4 |
Critical functions | Essential functions | Necessary functions | Desirable functions |
Critical functions | Must be continued at normal or increased service levels. | ||
---|---|---|---|
Essential functions | Must be continued if possible, even if it's in a reduced capacity. | ||
Necessary functions | Can be paused if necessary, but must be resumed within 30 days or sooner. | ||
Desirable functions | Can be paused and resumed when conditions allow. |
Estimated Downtime
Downtime factors resulting from a disruptive event will be estimated by working directly with business process owners, departmental staff, managers, and other stakeholders. The following downtime categories will be considered:
Maximum Tolerable Downtime (MTD). The MTD represents the total amount of time managers are willing to accept for a business process outage or disruption and includes all impact considerations. Determining MTD is important because it could leave continuity planners with imprecise direction on:
Selection of an appropriate recovery method; and
The depth of detail which will be required when developing recovery procedures, including their scope and content.
Recovery Time Objective (RTO). RTO defines the maximum amount of time that a system resource can remain unavailable before there is an unacceptable impact on other system resources, supported business processes, and the MTD. Determining the information system resource RTO is important for selecting appropriate technologies that are best suited for meeting the MTD.
Recovery Point Objective (RPO). The RPO represents the point in time, prior to a disruption or system outage, to which business process data must be recovered (given the most recent backup copy of the data) after an outage.
STEP 3. Identify Recovery Priorities for System Resources
Disaster Recovery Plan
Summary
This Disaster Recovery Policy ensures Patients Know Best (PKB) can recover from disruptions effectively. PKB operates on Google Cloud Platform, with encrypted nightly backups and continuous data archiving. The Disaster Recovery Plan (DRP) defines communication protocols, critical and non-critical systems, and recovery procedures, including notification, recovery, and reconstitution phases. Testing, maintenance, and regular audits validate the DRP's robustness. The policy outlines response strategies for various disasters, prioritising swift restoration of critical functions. Recovery procedures involve automated environment replication, testing, and DNS updates to minimise downtime and restore operations within 24 hours, ensuring continuity in delivering essential patient care services.
Purpose
This policy establishes procedures to recover Patients Know Best following a disruption resulting from a disaster. This Disaster Recovery Policy is maintained by the Information Governance Team. The PKB Disaster Recovery (‘DR Team’) will implement a recovery strategy based on the severity and nature of the incident. PKB has no dependence on a traditional physical office facility for the successful operation of business services because all PKB staff work in a distributed way rather than in a central office.
The PKB infrastructure is hosted by Google Cloud Platform (GCP). The GCP full compliance statement is available here: Cloud compliance and regulations resources
Scope
The Disaster Recovery policy includes the formation and maintenance of a Disaster Recovery Plan (DRP). To outline specific recovery initiatives, workflows and technical steps. The DRP includes primary and secondary emergency contact information for all named staff, alongside secondary individuals in case of absence or unavailability. Primary, secondary and tertiary communications channels are defined across different communication mediums, so as to lessen dependence on one medium. The DRP is tested for efficacy on a regular basis.
The following objectives have been established for this plan:
Maximise the effectiveness of contingency operations through an established plan that consists of the following phases:
Notification/Activation phase to detect and assess damage and to activate the plan.
Recovery phase to restore temporary operations and recover damage done to the original system.
Reconstitution phase to restore system processing capabilities to normal operations.
Identify the activities, resources, and procedures needed to carry out Patients Know Best processing requirements during prolonged interruptions to normal operations.
Identify and define the impact of interruptions to Patients Know Best systems.
Assign responsibilities to designated personnel and provide guidance for recovering Patients Know Best systems during prolonged periods of interruption to normal operations.
Ensure coordination with other Patients Know Best staff who will participate in the Disaster Recovery Planning strategies.
Policy
Examples of the types of disasters that would initiate this plan are natural disasters, political disturbances, man-made disasters, external human threats, and internal malicious activities.
Patients Know Best defines two categories of systems from a disaster recovery perspective:
Critical Systems. These systems host application servers and database servers or are required for functioning of systems that host application servers and database servers. These systems, if unavailable, affect the integrity of data and must be restored, or have a process begun to restore them, immediately upon becoming unavailable.
Non-critical Systems. These are all systems not considered critical by the definition above. These systems, while they may affect the performance and overall security of critical systems, do not prevent Critical systems from functioning and being accessed appropriately. These systems are restored at a lower priority than critical systems.
Threat and Risk Assessment and Management
There are many potential disruptive threats which can occur at any time and affect the normal business process. We have considered a wide range of potential threats and the results of our deliberations are included in this section. Each potential environmental disaster or emergency situation has been examined. The focus here is on the level of business disruption which could arise from each type of disaster.
The Patients Know Best IT Risk Assessment documents a full detailed assessment of threats.
Testing and Maintenance
The Senior Responsible Officer (SRO) shall establish criteria for validation/testing of a Disaster Recovery Plan, an annual test schedule, and ensure implementation of the test. This process will also serve as training for personnel involved in the plan's execution. At a minimum, the Disaster Recovery Plan shall be tested annually. The types of validation/testing exercises include tabletop and technical testing.
Patient Data Backups
The entire fully encrypted patient dataset is backed up nightly. PKB is also continuously archiving changes to this data set. In case of a total system failure (destroyed database servers, etc.) data can be recovered up to the minute before the failure.
Auditing
Frequent internal audits are conducted of the business continuity and disaster recovery program. These are performed as part of our biannual IG review process.
Tabletop Testing
The primary objective of the tabletop test is to ensure designated personnel are knowledgeable and capable of performing the notification/activation requirements and procedures as outlined in the Disaster Recovery Plan, in a timely manner. The exercises include, but are not limited to:
Testing to validate the ability to respond to a crisis in a coordinated, timely, and effective manner, by simulating the occurrence of a specific crisis.
Technical Testing
The primary objective of the technical test is to ensure the communication processes and data storage and recovery processes can function at an alternate site to perform the functions and capabilities of the system within the designated requirements. Technical testing shall include, but is not limited to:
Restore system using backups
Cloud configuration
Network recovery
Disaster Recovery Procedures
Notification and Activation Phase
This phase addresses the initial actions taken to detect and assess damage inflicted by a disruption to Patients Know Best. Based on the assessment of the Event, sometimes according to the Patients Know Best Incident Response Policy, the Disaster Recovery Plan may be activated by the SRO/CTO.
Notification Sequence
The Incident Response Group (IRG) through the Incident Coordinator (IC) will notify the Senior Leadership Team (SLT).
The CTO is to contact the rest of the Security team and inform them of the event. The Security Team begins assessment procedures to determine the extent of damage and estimated recovery time. If damage assessment cannot be performed locally because of unsafe conditions, the CTO is to follow the steps below.
Patients Know Best will notify customers of a business disruption, provide details of the recovery progress, and advise customers of any necessary interim arrangements to contact the company. The details of our business recovery plan are considered confidential and regarded as proprietary materials. However, we will be happy to address any specific questions or issues to assure confidence in PKB’s business continuity capabilities.
Damage Assessment
The CTO is to logically assess damage, gain insight into whether the infrastructure is salvageable, and begin to formulate a plan for recovery.
Alternate Assessment
Upon notification, the CTO is to follow the procedures for damage assessment with combined DevOps and Web Services Teams.
The Patients Know Best Disaster Recovery Plan is to be activated immediately in the event that part of the PKB infrastructure is compromised.
If the plan is to be activated, the CTO is to notify and inform team members of the details of the event.
Upon notification from the CTO, group leaders and managers are to notify their respective teams. Team members are to be informed of all applicable information and prepared to respond if necessary.
The IRG / DR Team is to notify remaining personnel and executive leadership on the general status of the incident.
Notification will be delivered by designated communication channels.
Worst Case RTO for Critical Functions
Critical Function | RTO |
Provide access to patient data via web portal. | 4 - 24 hrs |
Provide access to patient data via API. | 1 - 12 hrs |
Provide access to support function - ticketing system is offline. | 30 mins |
Since it is impossible to anticipate every type of potential disaster, there can be no assurance that there will be no interruption of the PKB business functions in all circumstances. However, PKB is committed to rigour and robustness in our approach and planning with regard to our Business Continuity Program.
Recovery Phase
The following procedures are for recovering the Patients Know Best infrastructure. Procedures are outlined per team required. Each procedure should be executed in the sequence it is presented to maintain efficient operations.
Recovery Goal
The goal is to rebuild Patients Know Best infrastructure to a production state. The tasks outlined below are not sequential and some can be run in parallel.
Contact Partners and Customers affected.
Assess damage to the infrastructure/technical environment.
Begin replication of new environments using automated and tested scripts. At this point it is determined whether to recover in Rackspace, AWS, GCP, Heroku, Azure, or another cloud environment.
Test a new environment using pre-written tests.
Test logging, security, and alerting functionality.
Assure systems are appropriately patched and up to date.
Deploy technical environment to production.
Update DNS to a new environment.
Reconstitution Phase
This section discusses activities necessary for restoring Patients Know Best operations. The goal is to restore full operations within 24 hours of a disaster or outage.
Original or New Site Restoration
Begin replication of new environment using automated and tested scripts (DevOps)
Test new environment using pre-written tests (Web Services)
Test logging, security, and alerting functionality (DevOps)
Deploy environment to production (Web Services)
Assure systems are appropriately patched and up-to-date (DevOps)
Update DNS to new environment (DevOps)
Approval and review
PKB Business Continuity Plan v2.3 was approved on the 9th October 2024. |
---|
Patients Know Best Wiki Hub | Deploy | Developer | Trust Centre | Manual | Research | Education | Release Notes
© Patients Know Best, Ltd. Registered in England and Wales Number: 6517382. VAT Number: GB 944 9739 67.