Monitoring, Logging va Observability
Mundarija
Observability Asoslari
Ta'rif
Observability - bu tizim ichki holatini tashqi output'lar orqali tushunish qobiliyati.
3 Pillars of Observability:
1. Metrics → What is happening? (Numbers, trends)
2. Logs → Why is it happening? (Events, errors)
3. Traces → Where is it happening? (Request flow)
Monitoring vs Observability
| Monitoring | Observability |
|---|---|
| Predefined metrics | Unknown unknowns |
| "Is it working?" | "Why is it not working?" |
| Reactive | Proactive |
| Dashboards, alerts | Correlation, exploration |
Kubernetes Observability Challenges
- Ephemeral Nature: Pod'lar keladi-ketadi
- Distributed Systems: Microservices, ko'p komponentlar
- Dynamic Scaling: Resource'lar o'zgaradi
- Multi-tenancy: Bir nechta team, namespace
Golden Signals
Google SRE'ning to'rtta asosiy metric:
- Latency: Request qancha vaqt oladi?
- Traffic: Qancha request kelyapti?
- Errors: Qancha request fail bo'lyapti?
- Saturation: Resource qanchalik to'lgan?
Metrics
Ta'rif
Metrics - bu raqamli o'lchovlar (time-series data).
Metric Types:
- Counter: Faqat oshadi (requests, errors)
- Gauge: Up/down (CPU, memory, connections)
- Histogram: Distribution (request duration)
- Summary: Quantiles (p50, p95, p99)
Kubernetes Metrics
1. Node Metrics
- CPU usage
- Memory usage
- Disk I/O
- Network I/O
- Filesystem usage
2. Pod Metrics
- CPU usage
- Memory usage
- Network I/O
- Container restarts
3. Container Metrics
- CPU usage
- Memory usage
- Disk usage
4. Cluster Metrics
- Total nodes
- Total pods
- API server requests
- etcd performance
Metrics Server
Ta'rif: Kubernetes cluster ichidagi lightweight metrics aggregator.
O'rnatish:
kubectl apply -f https://github.com/kubernetes-sigs/metrics-server/releases/latest/download/components.yaml
Resource Metrics API:
# Node metrics
kubectl top nodes
# Pod metrics
kubectl top pods
# Specific namespace
kubectl top pods -n kube-system
# Container metrics
kubectl top pods --containers
Output:
NAME CPU(cores) MEMORY(bytes)
node-1 250m 1024Mi
node-2 180m 890Mi
NAME CPU(cores) MEMORY(bytes)
nginx-deployment-abc 10m 50Mi
redis-master 5m 30Mi
Limitations:
- Faqat current metrics (no history)
- Limited metrics (CPU, memory)
- Not for alerting
Prometheus
Ta'rif: Open-source monitoring va alerting toolkit.
Architecture:
Prometheus Server
↓ (scrape)
Target Endpoints (exporters)
↓
Time-Series Database
↓
Query (PromQL)
↓
Grafana / Alertmanager
Components:
- Prometheus Server: Metrics scraping va storage
- Exporters: Metrics expose qilish
- Alertmanager: Alert management
- Pushgateway: Short-lived job'lar uchun
Prometheus Installation
Helm:
# Add repo
helm repo add prometheus-community https://prometheus-community.github.io/helm-charts
helm repo update
# Install
helm install prometheus prometheus-community/kube-prometheus-stack \
--namespace monitoring \
--create-namespace
Verify:
kubectl get pods -n monitoring
kubectl get svc -n monitoring
ServiceMonitor
Ta'rif: Prometheus'ga qaysi service'larni monitor qilishni aytish.
YAML:
apiVersion: monitoring.coreos.com/v1
kind: ServiceMonitor
metadata:
name: myapp-monitor
namespace: monitoring
spec:
selector:
matchLabels:
app: myapp
endpoints:
- port: metrics
interval: 30s
path: /metrics
Application Service:
apiVersion: v1
kind: Service
metadata:
name: myapp
labels:
app: myapp
spec:
ports:
- name: metrics
port: 8080
targetPort: 8080
selector:
app: myapp
PromQL (Prometheus Query Language)
Basic Queries:
Instant Vector:
# CPU usage
node_cpu_seconds_total
# Memory usage
node_memory_MemAvailable_bytes
# HTTP requests
http_requests_total
Filtering:
# By label
http_requests_total{job="api-server"}
# Multiple labels
http_requests_total{job="api-server", method="GET"}
# Regex
http_requests_total{status=~"5.."}
Aggregation:
# Sum
sum(http_requests_total)
# Average
avg(node_cpu_seconds_total)
# By label
sum(http_requests_total) by (job)
Rate:
# Requests per second
rate(http_requests_total[5m])
# Increase
increase(http_requests_total[1h])
Functions:
# Percentile
histogram_quantile(0.95, rate(http_request_duration_seconds_bucket[5m]))
# Round
round(node_memory_MemAvailable_bytes / 1024 / 1024)
Complex Queries:
# CPU usage percentage
100 - (avg(rate(node_cpu_seconds_total{mode="idle"}[5m])) * 100)
# Memory usage percentage
(1 - (node_memory_MemAvailable_bytes / node_memory_MemTotal_bytes)) * 100
# Request error rate
sum(rate(http_requests_total{status=~"5.."}[5m])) /
sum(rate(http_requests_total[5m])) * 100
Application Metrics
Instrumentation:
Go (Prometheus client):
import (
"github.com/prometheus/client_golang/prometheus"
"github.com/prometheus/client_golang/prometheus/promhttp"
)
var (
httpRequestsTotal = prometheus.NewCounterVec(
prometheus.CounterOpts{
Name: "http_requests_total",
Help: "Total HTTP requests",
},
[]string{"method", "endpoint", "status"},
)
httpRequestDuration = prometheus.NewHistogramVec(
prometheus.HistogramOpts{
Name: "http_request_duration_seconds",
Help: "HTTP request duration",
Buckets: []float64{.005, .01, .025, .05, .1, .25, .5, 1, 2.5, 5, 10},
},
[]string{"method", "endpoint"},
)
)
func init() {
prometheus.MustRegister(httpRequestsTotal)
prometheus.MustRegister(httpRequestDuration)
}
func handler(w http.ResponseWriter, r *http.Request) {
timer := prometheus.NewTimer(httpRequestDuration.WithLabelValues(r.Method, r.URL.Path))
defer timer.ObserveDuration()
// Handle request
httpRequestsTotal.WithLabelValues(r.Method, r.URL.Path, "200").Inc()
}
func main() {
http.Handle("/metrics", promhttp.Handler())
http.HandleFunc("/api", handler)
http.ListenAndServe(":8080", nil)
}
Python (Prometheus client):
from prometheus_client import Counter, Histogram, start_http_server
import time
request_count = Counter('http_requests_total', 'Total requests',
['method', 'endpoint', 'status'])
request_duration = Histogram('http_request_duration_seconds',
'Request duration',
['method', 'endpoint'])
@request_duration.labels(method='GET', endpoint='/api').time()
def handle_request():
# Process request
time.sleep(0.1)
request_count.labels(method='GET', endpoint='/api', status='200').inc()
if __name__ == '__main__':
start_http_server(8080)
# Application logic
Custom Metrics
Custom Resource:
apiVersion: v1
kind: Service
metadata:
name: myapp
annotations:
prometheus.io/scrape: "true"
prometheus.io/port: "8080"
prometheus.io/path: "/metrics"
spec:
ports:
- name: metrics
port: 8080
selector:
app: myapp
Logging
Ta'rif
Logging - bu tizim event'larini yozib olish.
Log Types:
- Application Logs: Application'dan
- System Logs: OS, kernel
- Audit Logs: Security events
Kubernetes Logging
1. Container Logs
Stdout/Stderr:
Container'dagi application stdout/stderr'ga yozadi:
# Simple logging
echo "Hello World"
# Structured logging (JSON)
echo '{"level":"info","message":"Hello World","timestamp":"2024-01-01T00:00:00Z"}'
View logs:
# Current logs
kubectl logs <pod-name>
# Previous container
kubectl logs <pod-name> --previous
# Specific container
kubectl logs <pod-name> -c <container-name>
# Follow logs
kubectl logs <pod-name> -f
# Since
kubectl logs <pod-name> --since=1h
# Tail
kubectl logs <pod-name> --tail=100
2. Node Logs
Node logging driver (Docker, containerd) log'larni file'ga yozadi:
Docker:
/var/lib/docker/containers/<container-id>/<container-id>-json.log
Containerd:
/var/log/pods/<namespace>_<pod-name>_<pod-id>/<container-name>/<restart-count>.log
3. Cluster Component Logs
Control Plane:
- kube-apiserver
- kube-scheduler
- kube-controller-manager
- etcd
Worker Node:
- kubelet
- kube-proxy
Location:
- Systemd:
journalctl -u kubelet - Static pods:
/var/log/pods/
Logging Patterns
1. Node-level Logging
Advantages:
- Simple
- No additional agents
Disadvantages:
- Logs node bilan birga yo'qoladi
- Retention limited
- No centralization
2. Sidecar Container
Streaming Sidecar:
apiVersion: v1
kind: Pod
metadata:
name: app-pod
spec:
containers:
- name: app
image: myapp:1.0
volumeMounts:
- name: logs
mountPath: /var/log
- name: log-shipper
image: busybox
command: ['sh', '-c', 'tail -f /var/log/app.log']
volumeMounts:
- name: logs
mountPath: /var/log
volumes:
- name: logs
emptyDir: {}
Advantages:
- Separation of concerns
- Multiple log streams
Disadvantages:
- Resource overhead
- Complexity
3. Node-level Logging Agent (DaemonSet)
Architecture:
Application → stdout/stderr
↓
Node log file
↓
Logging Agent (DaemonSet)
↓
Central Log Storage
Most common pattern!
EFK Stack (Elasticsearch, Fluentd, Kibana)
Components:
- Fluentd (or Fluent Bit): Log collection
- Elasticsearch: Log storage va search
- Kibana: Visualization
Fluentd DaemonSet
apiVersion: apps/v1
kind: DaemonSet
metadata:
name: fluentd
namespace: kube-system
spec:
selector:
matchLabels:
app: fluentd
template:
metadata:
labels:
app: fluentd
spec:
serviceAccountName: fluentd
containers:
- name: fluentd
image: fluent/fluentd-kubernetes-daemonset:v1-debian-elasticsearch
env:
- name: FLUENT_ELASTICSEARCH_HOST
value: "elasticsearch.logging.svc.cluster.local"
- name: FLUENT_ELASTICSEARCH_PORT
value: "9200"
volumeMounts:
- name: varlog
mountPath: /var/log
- name: varlibdockercontainers
mountPath: /var/lib/docker/containers
readOnly: true
volumes:
- name: varlog
hostPath:
path: /var/log
- name: varlibdockercontainers
hostPath:
path: /var/lib/docker/containers
Elasticsearch
# Helm install
helm repo add elastic https://helm.elastic.co
helm install elasticsearch elastic/elasticsearch \
--namespace logging \
--create-namespace \
--set replicas=3
Kibana
helm install kibana elastic/kibana \
--namespace logging \
--set elasticsearchHosts=http://elasticsearch-master:9200
Access Kibana:
kubectl port-forward -n logging svc/kibana-kibana 5601:5601
# Open http://localhost:5601
Loki Stack (Grafana Loki)
Alternative to EFK: Lightweight, cost-effective.
Components:
- Promtail: Log collection
- Loki: Log storage
- Grafana: Visualization
Installation:
helm repo add grafana https://grafana.github.io/helm-charts
helm install loki grafana/loki-stack \
--namespace logging \
--create-namespace \
--set grafana.enabled=true \
--set prometheus.enabled=true \
--set promtail.enabled=true
Advantages over EFK:
- Cheaper (index labels, not full text)
- Simpler
- Grafana integration
Structured Logging
JSON Logging:
{
"timestamp": "2024-01-01T12:00:00Z",
"level": "info",
"message": "User logged in",
"user_id": "12345",
"ip": "192.168.1.1",
"trace_id": "abc123"
}
Advantages:
- Easy parsing
- Rich metadata
- Correlation (trace_id)
Libraries:
- Go: logrus, zap
- Python: structlog
- Java: logback
- Node.js: winston
Log Retention
Considerations:
- Storage Cost: Log data juda ko'p
- Compliance: Regulatory requirements
- Performance: Index size
Strategy:
Hot tier (0-7 days): Fast SSD, full indexing
Warm tier (7-30 days): Slower disk, reduced replicas
Cold tier (30-90 days): Archive, searchable backup
Delete (>90 days): Compliance period tugasa
Tracing
Ta'rif
Tracing - bu distributed system'da request'ning to'liq yo'lini kuzatish.
Use Case: Microservice architecture'da bitta request ko'p service'lardan o'tadi:
Client → Gateway → Auth → UserService → Database
↓
EmailService
Har bir hop latency qo'shadi. Qayerda bottleneck?
OpenTelemetry
Ta'rif: Vendor-neutral observability framework.
Components:
- SDK: Application instrumentation
- API: Standard interface
- Collector: Telemetry data collection
- Exporters: Backend'larga export
Supported Backends:
- Jaeger
- Zipkin
- Tempo (Grafana)
- DataDog
- New Relic
Jaeger
Architecture:
Application (instrumented)
↓ (trace data)
Jaeger Agent
↓
Jaeger Collector
↓
Cassandra/Elasticsearch
↓
Jaeger Query
↓
Jaeger UI
Installation:
kubectl create namespace observability
kubectl apply -f https://raw.githubusercontent.com/jaegertracing/jaeger-operator/main/deploy/crds/jaegertracing.io_jaegers_crd.yaml
kubectl apply -f https://raw.githubusercontent.com/jaegertracing/jaeger-operator/main/deploy/service_account.yaml
kubectl apply -f https://raw.githubusercontent.com/jaegertracing/jaeger-operator/main/deploy/role.yaml
kubectl apply -f https://raw.githubusercontent.com/jaegertracing/jaeger-operator/main/deploy/role_binding.yaml
kubectl apply -f https://raw.githubusercontent.com/jaegertracing/jaeger-operator/main/deploy/operator.yaml
Jaeger Instance:
apiVersion: jaegertracing.io/v1
kind: Jaeger
metadata:
name: simplest
namespace: observability
Application Instrumentation
OpenTelemetry SDK:
Go:
import (
"go.opentelemetry.io/otel"
"go.opentelemetry.io/otel/exporters/jaeger"
"go.opentelemetry.io/otel/sdk/resource"
"go.opentelemetry.io/otel/sdk/trace"
)
func initTracer() func() {
exporter, _ := jaeger.New(jaeger.WithCollectorEndpoint(
jaeger.WithEndpoint("http://jaeger-collector:14268/api/traces"),
))
tp := trace.NewTracerProvider(
trace.WithBatcher(exporter),
trace.WithResource(resource.NewWithAttributes(
semconv.ServiceNameKey.String("myapp"),
)),
)
otel.SetTracerProvider(tp)
return func() { tp.Shutdown(context.Background()) }
}
func handleRequest(w http.ResponseWriter, r *http.Request) {
tracer := otel.Tracer("myapp")
ctx, span := tracer.Start(r.Context(), "handleRequest")
defer span.End()
// Call other services with ctx
result := callUserService(ctx)
span.SetAttributes(attribute.String("user.id", result.UserID))
}
Trace Context Propagation
HTTP Headers:
traceparent: 00-0af7651916cd43dd8448eb211c80319c-b7ad6b7169203331-01
tracestate: congo=t61rcWkgMzE
Automatic propagation:
- OpenTelemetry auto-instrumentation
- Service mesh (Istio, Linkerd)
Monitoring Stack
Complete Observability Stack
Recommended Stack:
┌─────────────────────────────────────────┐
│ Grafana (Visualization) │
├─────────────────────────────────────────┤
│ Prometheus │ Loki │ Tempo/Jaeger │
│ (Metrics) │ (Logs) │ (Traces) │
├─────────────────────────────────────────┤
│ Application (Instrumented) │
└─────────────────────────────────────────┘
Installation (Helm):
# Kube-prometheus-stack (Prometheus + Grafana)
helm install kube-prometheus-stack prometheus-community/kube-prometheus-stack \
--namespace monitoring \
--create-namespace
# Loki
helm install loki grafana/loki-stack \
--namespace monitoring \
--set grafana.enabled=false \
--set prometheus.enabled=false
# Tempo (Grafana Tempo)
helm install tempo grafana/tempo \
--namespace monitoring
Alerting
Prometheus Alerting
PrometheusRule:
apiVersion: monitoring.coreos.com/v1
kind: PrometheusRule
metadata:
name: app-alerts
namespace: monitoring
spec:
groups:
- name: app
interval: 30s
rules:
- alert: HighCPUUsage
expr: |
100 - (avg(rate(node_cpu_seconds_total{mode="idle"}[5m])) * 100) > 80
for: 5m
labels:
severity: warning
annotations:
summary: "High CPU usage detected"
description: "CPU usage is above 80% for 5 minutes"
- alert: PodCrashLooping
expr: |
rate(kube_pod_container_status_restarts_total[15m]) > 0
for: 5m
labels:
severity: critical
annotations:
summary: "Pod {{ $labels.pod }} is crash looping"
description: "Pod has restarted {{ $value }} times in 15 minutes"
- alert: HighErrorRate
expr: |
(sum(rate(http_requests_total{status=~"5.."}[5m])) /
sum(rate(http_requests_total[5m]))) * 100 > 5
for: 5m
labels:
severity: critical
annotations:
summary: "High error rate detected"
description: "Error rate is {{ $value }}%"
Alertmanager
Configuration:
apiVersion: v1
kind: ConfigMap
metadata:
name: alertmanager-config
namespace: monitoring
data:
alertmanager.yml: |
global:
resolve_timeout: 5m
route:
receiver: 'default'
group_by: ['alertname', 'cluster']
group_wait: 10s
group_interval: 10s
repeat_interval: 12h
routes:
- match:
severity: critical
receiver: 'pagerduty'
- match:
severity: warning
receiver: 'slack'
receivers:
- name: 'default'
email_configs:
- to: 'team@example.com'
from: 'alertmanager@example.com'
smarthost: 'smtp.example.com:587'
- name: 'slack'
slack_configs:
- api_url: 'https://hooks.slack.com/services/XXX'
channel: '#alerts'
title: '{{ .GroupLabels.alertname }}'
text: '{{ range .Alerts }}{{ .Annotations.description }}{{ end }}'
- name: 'pagerduty'
pagerduty_configs:
- service_key: 'YOUR_SERVICE_KEY'
Dashboarding
Grafana Dashboards
Import Dashboard:
# Port forward
kubectl port-forward -n monitoring svc/kube-prometheus-stack-grafana 3000:80
# Default credentials
# Username: admin
# Password: prom-operator
Popular Dashboards:
- Kubernetes Cluster Monitoring (315)
- Node Exporter Full (1860)
- Kubernetes Pods (6417)
Custom Dashboard (JSON):
{
"dashboard": {
"title": "My App Dashboard",
"panels": [
{
"title": "Request Rate",
"targets": [
{
"expr": "rate(http_requests_total[5m])"
}
]
}
]
}
}
Xulosa
Kubernetes Observability - bu 3 pillar:
✅ Metrics: Prometheus, Metrics Server ✅ Logging: EFK/Loki Stack ✅ Tracing: Jaeger, Tempo ✅ Alerting: Alertmanager ✅ Visualization: Grafana