We are running the Prometheus as our white-box monitoring solution for the last 1 and a half years. Following are the point’s to be noted:
1. For capacity sample size should be observed with time. Prometheus metrics for Prometheus itself should be exposed. For reliability cross data center/cross-environment of monitoring should be enabled for Prometheus. Metrics(Prometheus metrics)should be collected with cross data center / cross environment.
2. Recording rules must be added. For the most expensive CPU based queries all queries must be optimized. Recording rules should be added to all the expensive queries.
3. Plan for Prometheus high availability and redundancy :
Although Thanos(etc) can handle the higher availability of the Prometheus very well. But things need to consider,
a. For the Platform team, we should have dedicated Prometheus setup for platform visibility.
b. For Platform tenants(application users), there should be another setup.
c. Failure tests (killing Prometheus pod, shutting down server, rack failure, network connectivity loss) should be performed.
d. Limits on the number of samples per Prometheus (CPU, MEM, Network) must be decided.
e. Differentiate the apps for service discovery based on Labels for Prometheus.
4. Prometheus settings must be tuned. For Example Alertmanager Cluster, File Descriptors for Prometheus, limits on several series to be returned. (Tenants should not query the huge chunk that will make CPU usage 100%. It should be restricted for N number of days).
5. Failover mechanism: It is important in case of an active Prometheus server if we lose the data then we should failover to backup Prometheus. False alerts should be notified to users.
6. Logs must be stored somewhere :
i. To check logs of grafana ( Expensive queries in terms of CPU) (Nginx ? Istio?)
ii. To check Prometheus errors
7. SLAs must be decided for Prometheus and should be alerted to the platform team. (For Prometheus).