By Oscar Morales
The following is then incident report for the Apache server occurred on December 09, 2020.
- Duration: From 09/12/2020 12:00:00 AM to 09/12/2020 8:06:00 AM (GMT-5)
- Amount of users impacted: 100%
- Root cause: Apache2 service is not running
The issue was detected on the morning of 09/12/2020. Users reported that they could not enter to the website recieving a 500 HTTP response status code.
Timeline (all times in GMT-5)
- 8:00 AM: Users report the issue
- 8:02 AM: Server debugging starts
- 8:03 AM: Connection to website is made to check the HTTP response status code (500 is received)
- 8:04 AM: Verification of the apache2 service starts
- 8:05 AM: Apache2 service start begin
- 8:06 AM: Connection to the website is made to check the HTTP response status code (200 is received)
At 7:50 AM GMT-5, to make possible apply of some changes in the configuration of the Apache server, the apache2 service was stopped but never started. This caused an server error when users were trying access to the website.
Resolution and recovery
At 8:00 AM GMT-5, users repost the issue and quickly the debugging began. By 8:03 AM, was identified that the issue involved the server due the HTTP response status code.
At 8:04 AM, a connection via ssh with the container that runs the server was made to check the status of the server, getting that the apache2 service was not running. Next, the apache2 was started.
At 8:06 AM, the connection to the website was made getting a 200 HTTP response status code restoring the connection between users and server.
Corrective and Preventative Measures
Making a review and analysis some actions are required to avoid these issues in the future:
- It is necessary to implement a monitoring service
- Use configuration languages like Puppet to guarantee the status of changes applied on the server
- Improve process for auditing all high-risk configuration options
- Develop better mechanism for quickly delivering status notifications during incidents
It is necessary continually and quickly improving our technology and operational processes to prevent issues. We apologize to all users for any inconvenience they may have had.