What decision should I make about an incident?

This is mainly written from a production engineer’s view. This might be helpful for an engineer who is response for some production service, and just wants a sanity check… On with the question. Sometimes you have legacy software that most of your company’s critical path relies on. That software hasn’t powered off in X amount of scary years. Some combination of documentation, automation of task and source control code is nonexistent. One question to ask is — when was the last time this box (host machine) powered off? Do you know what it takes to restart process(es). Are you positive you can restart all the processes on that host if you powered it off and used that restart it trick? If the answer is no, you my friend have some (pet instances). Is the recovery plan well documented? Actionable documentation is key. The documentation should contain what to do. The dependent services or jobs that might need a restart, kill, relaunch, rescheduled. The proper sequence of task execution. Consumers and the stakeholders of the system? I’m going overkill but you get the point. What does data integrity and retention policies exist for this system? Sometimes there will be DB locks that are handling payment operations at the application layer, inoperable code that will cause memleaks or hardware failures. Is it okay to lose any of that data? I say yes it is okay to lose data. If not of course there are more expensive solutions to deploy. But you can’t think about that now. If the incident is in a domain that is out of your comfort zone with solving, than use the appropriate escalation path. Don’t just say is anyone getting a 400 error? Describe the service, where it is failing and any other key details.

How important are user experiences? How are they experiencing the current incident? Not thinking how but actually engaging and getting input from your users if possible? If these are services that your users interact with, or supports internal employees to help customers have those same experiences when they call in are vital. Execute this well. Then everyone will feel great about the decisions made at the resolution. I suggest codifying as much as you can during the incident regardless if there wasn’t anything before. I can see checking logs and restarting processes. If you are starting process(es) with different flag options, or installing dependencies; than you should code those steps and document them. That way you can have a changelog for the postmortem. The best bet is to have the code reviewed during the incident by a team member. Reviewed code helps show consensus on the path to resolution. Also if you have to step away from the incident or the incident occurs again, someone else may pick it up from your train of thought. Through this process you are improving your code base, and removing technical debt during an incident. Don’t forget to check in on the users affected . Inform them of any updates to the system and if there is any loss of data. Communication is key and make sure you consistently provide updates. Users will always inquire for an ETA. Sometimes you will be unsure. So give a time period of when you will provide the next update. 30 minutes is usually always fair unless you have strict SLAs. After this incident is over write a postmortem (Michael’s is the best IMO, and he’s a cool SRE) and publish it for the company. Don’t have a Post morteum Culture? bring It. Send it out as an email to your organization as a technical thought. Set up a meeting with an optional invite that people can join, and learn more about your experiences and the systems that exist.

What decision should I make about an incident?

But guess what. Now you know more about how that shit works.