Conclusion

A robust monitoring and alerting system is necessary for maintaining and troubleshooting a system. A dashboard with key metrics can give you an overview of service performance, all in one place. Well-defined alerts (with realistic thresholds and notifications) further enable you to quickly identify any anomalies in the service infrastructure and in resource saturation. By taking necessary actions, you can avoid any service degradations and decrease MTTD for service breakdowns.

In addition to in-house monitoring, monitoring real-user experience can help you to understand service performance as perceived by the users. Many modules are involved in serving the user, and most of them are out of your control. Therefore, you need to have real-user monitoring in place.

Metrics give very abstract details on service performance. To get a better understanding of the system and for faster recovery during incidents, you might want to implement the other two pillars of observability: logs and tracing. Logs and trace data can help you understand what led to service failure or degradation.

Following are some resources to learn more about monitoring and observability:

References

Google SRE book: Monitoring Distributed Systems
Mastering Distributed Tracing, by Yuri Shkuro
Monitoring and Observability
Three PIllars with Zero Answers
Engineering blogs on LinkedIn, Grafana, Elastic.co, OpenTelemetry