System troubleshooting and performance improvements
Prerequisites
What to expect from this course
This brief course tries to provide a general introduction on how to troubleshoot system issues, like analysing api failures, resource utilization, network issues, hardware and OS issues. Course also briefs on profiling and benchmarking to measure overall system performance.
What is not covered under this course
This course does not cover following -:
- System Design and Architecture.
- Programming practices.
- Metrics and Monitoring.
- OS basics.
Course Contents
- Introduction
- Troubleshooting
- Important tools to know
- Performance improvements
- Troubleshooting Example
- Conclusion
Introduction
Troubleshooting is an important part of operations & development. It can’t be learned by just reading one article or completing a course online, Its a continuous learning process, one learns it during :-
- Daily operations and development.
- Finding & Fixing application bugs.
- Finding & Fixing system & network issues.
- Performance analysis and improvements.
- And more.
From an SRE’s perspective, It is expected that they are aware of certain topics upfront to be able to troubleshoot problems around single or distributed systems.
- Know your resources well, understand host specifications, liks CPU, Memory, Network, Disk etc.
- Understand system design and architecture.
- Ensure important metrics are being collected/rendered properly.
There was a famous quote by HP founders - “What gets measured gets fixed”
If system components and performance metrics are captured thoroughly then there is a high chance of success in troubleshooting an issue, at its earliest.
Scope
There is no common approach to troubleshoot different types of applications or services, the failure can occur at any layer of it. We will keep the scope of this work to a web api service type only.
Note -: Linux ecosystem is wide, there are hundreds of tools and utilities which can help with system troubleshooting, each comes with its own set of benefits and functionalities. We will cover some of the known tools, either already available with Linux or are available in the open source world. Detailed explanation of mentioned tools in this doc is out of scope, please explore the internet or man pages for more examples and documentation around the same.