Skip to content

Conclusion

Complex systems have many factors which can go wrong. It can be a bad design & architecture, poorly managed code, poor policies around different caches, bad DB queries or architecture, improper use of resources, or bad OS version, poorly monitored system, datacenter issues, network faults, and many more, Any of these can go wrong.

As an SRE, Knowing important tools/commands, best practices, profiling, benchmarking and scaling can help you with faster troubleshooting and performance improvement of the overall system.

Further readings

Here are some links from the LinkedIn Engineering Blog, as written by LinkedIn engineers, about firefighting they did, ensuring site up 24x7x365.