Skip to main content

Monitoring, Logging va Observability

Mundarija

  1. Observability Asoslari
  2. Metrics
  3. Logging
  4. Tracing
  5. Monitoring Stack
  6. Alerting
  7. Dashboarding

Observability Asoslari

Ta'rif

Observability - bu tizim ichki holatini tashqi output'lar orqali tushunish qobiliyati.

3 Pillars of Observability:

1. Metrics   → What is happening? (Numbers, trends)
2. Logs → Why is it happening? (Events, errors)
3. Traces → Where is it happening? (Request flow)

Monitoring vs Observability

MonitoringObservability
Predefined metricsUnknown unknowns
"Is it working?""Why is it not working?"
ReactiveProactive
Dashboards, alertsCorrelation, exploration

Kubernetes Observability Challenges

  1. Ephemeral Nature: Pod'lar keladi-ketadi
  2. Distributed Systems: Microservices, ko'p komponentlar
  3. Dynamic Scaling: Resource'lar o'zgaradi
  4. Multi-tenancy: Bir nechta team, namespace

Golden Signals

Google SRE'ning to'rtta asosiy metric:

  1. Latency: Request qancha vaqt oladi?
  2. Traffic: Qancha request kelyapti?
  3. Errors: Qancha request fail bo'lyapti?
  4. Saturation: Resource qanchalik to'lgan?

Metrics

Ta'rif

Metrics - bu raqamli o'lchovlar (time-series data).

Metric Types:

  1. Counter: Faqat oshadi (requests, errors)
  2. Gauge: Up/down (CPU, memory, connections)
  3. Histogram: Distribution (request duration)
  4. Summary: Quantiles (p50, p95, p99)

Kubernetes Metrics

1. Node Metrics

  • CPU usage
  • Memory usage
  • Disk I/O
  • Network I/O
  • Filesystem usage

2. Pod Metrics

  • CPU usage
  • Memory usage
  • Network I/O
  • Container restarts

3. Container Metrics

  • CPU usage
  • Memory usage
  • Disk usage

4. Cluster Metrics

  • Total nodes
  • Total pods
  • API server requests
  • etcd performance

Metrics Server

Ta'rif: Kubernetes cluster ichidagi lightweight metrics aggregator.

O'rnatish:

kubectl apply -f https://github.com/kubernetes-sigs/metrics-server/releases/latest/download/components.yaml

Resource Metrics API:

# Node metrics
kubectl top nodes

# Pod metrics
kubectl top pods

# Specific namespace
kubectl top pods -n kube-system

# Container metrics
kubectl top pods --containers

Output:

NAME           CPU(cores)   MEMORY(bytes)
node-1 250m 1024Mi
node-2 180m 890Mi

NAME CPU(cores) MEMORY(bytes)
nginx-deployment-abc 10m 50Mi
redis-master 5m 30Mi

Limitations:

  • Faqat current metrics (no history)
  • Limited metrics (CPU, memory)
  • Not for alerting

Prometheus

Ta'rif: Open-source monitoring va alerting toolkit.

Architecture:

Prometheus Server
↓ (scrape)
Target Endpoints (exporters)

Time-Series Database

Query (PromQL)

Grafana / Alertmanager

Components:

  1. Prometheus Server: Metrics scraping va storage
  2. Exporters: Metrics expose qilish
  3. Alertmanager: Alert management
  4. Pushgateway: Short-lived job'lar uchun

Prometheus Installation

Helm:

# Add repo
helm repo add prometheus-community https://prometheus-community.github.io/helm-charts
helm repo update

# Install
helm install prometheus prometheus-community/kube-prometheus-stack \
--namespace monitoring \
--create-namespace

Verify:

kubectl get pods -n monitoring
kubectl get svc -n monitoring

ServiceMonitor

Ta'rif: Prometheus'ga qaysi service'larni monitor qilishni aytish.

YAML:

apiVersion: monitoring.coreos.com/v1
kind: ServiceMonitor
metadata:
name: myapp-monitor
namespace: monitoring
spec:
selector:
matchLabels:
app: myapp
endpoints:
- port: metrics
interval: 30s
path: /metrics

Application Service:

apiVersion: v1
kind: Service
metadata:
name: myapp
labels:
app: myapp
spec:
ports:
- name: metrics
port: 8080
targetPort: 8080
selector:
app: myapp

PromQL (Prometheus Query Language)

Basic Queries:

Instant Vector:

# CPU usage
node_cpu_seconds_total

# Memory usage
node_memory_MemAvailable_bytes

# HTTP requests
http_requests_total

Filtering:

# By label
http_requests_total{job="api-server"}

# Multiple labels
http_requests_total{job="api-server", method="GET"}

# Regex
http_requests_total{status=~"5.."}

Aggregation:

# Sum
sum(http_requests_total)

# Average
avg(node_cpu_seconds_total)

# By label
sum(http_requests_total) by (job)

Rate:

# Requests per second
rate(http_requests_total[5m])

# Increase
increase(http_requests_total[1h])

Functions:

# Percentile
histogram_quantile(0.95, rate(http_request_duration_seconds_bucket[5m]))

# Round
round(node_memory_MemAvailable_bytes / 1024 / 1024)

Complex Queries:

# CPU usage percentage
100 - (avg(rate(node_cpu_seconds_total{mode="idle"}[5m])) * 100)

# Memory usage percentage
(1 - (node_memory_MemAvailable_bytes / node_memory_MemTotal_bytes)) * 100

# Request error rate
sum(rate(http_requests_total{status=~"5.."}[5m])) /
sum(rate(http_requests_total[5m])) * 100

Application Metrics

Instrumentation:

Go (Prometheus client):

import (
"github.com/prometheus/client_golang/prometheus"
"github.com/prometheus/client_golang/prometheus/promhttp"
)

var (
httpRequestsTotal = prometheus.NewCounterVec(
prometheus.CounterOpts{
Name: "http_requests_total",
Help: "Total HTTP requests",
},
[]string{"method", "endpoint", "status"},
)

httpRequestDuration = prometheus.NewHistogramVec(
prometheus.HistogramOpts{
Name: "http_request_duration_seconds",
Help: "HTTP request duration",
Buckets: []float64{.005, .01, .025, .05, .1, .25, .5, 1, 2.5, 5, 10},
},
[]string{"method", "endpoint"},
)
)

func init() {
prometheus.MustRegister(httpRequestsTotal)
prometheus.MustRegister(httpRequestDuration)
}

func handler(w http.ResponseWriter, r *http.Request) {
timer := prometheus.NewTimer(httpRequestDuration.WithLabelValues(r.Method, r.URL.Path))
defer timer.ObserveDuration()

// Handle request

httpRequestsTotal.WithLabelValues(r.Method, r.URL.Path, "200").Inc()
}

func main() {
http.Handle("/metrics", promhttp.Handler())
http.HandleFunc("/api", handler)
http.ListenAndServe(":8080", nil)
}

Python (Prometheus client):

from prometheus_client import Counter, Histogram, start_http_server
import time

request_count = Counter('http_requests_total', 'Total requests',
['method', 'endpoint', 'status'])
request_duration = Histogram('http_request_duration_seconds',
'Request duration',
['method', 'endpoint'])

@request_duration.labels(method='GET', endpoint='/api').time()
def handle_request():
# Process request
time.sleep(0.1)
request_count.labels(method='GET', endpoint='/api', status='200').inc()

if __name__ == '__main__':
start_http_server(8080)
# Application logic

Custom Metrics

Custom Resource:

apiVersion: v1
kind: Service
metadata:
name: myapp
annotations:
prometheus.io/scrape: "true"
prometheus.io/port: "8080"
prometheus.io/path: "/metrics"
spec:
ports:
- name: metrics
port: 8080
selector:
app: myapp

Logging

Ta'rif

Logging - bu tizim event'larini yozib olish.

Log Types:

  1. Application Logs: Application'dan
  2. System Logs: OS, kernel
  3. Audit Logs: Security events

Kubernetes Logging

1. Container Logs

Stdout/Stderr:

Container'dagi application stdout/stderr'ga yozadi:

# Simple logging
echo "Hello World"

# Structured logging (JSON)
echo '{"level":"info","message":"Hello World","timestamp":"2024-01-01T00:00:00Z"}'

View logs:

# Current logs
kubectl logs <pod-name>

# Previous container
kubectl logs <pod-name> --previous

# Specific container
kubectl logs <pod-name> -c <container-name>

# Follow logs
kubectl logs <pod-name> -f

# Since
kubectl logs <pod-name> --since=1h

# Tail
kubectl logs <pod-name> --tail=100

2. Node Logs

Node logging driver (Docker, containerd) log'larni file'ga yozadi:

Docker:

/var/lib/docker/containers/<container-id>/<container-id>-json.log

Containerd:

/var/log/pods/<namespace>_<pod-name>_<pod-id>/<container-name>/<restart-count>.log

3. Cluster Component Logs

Control Plane:

  • kube-apiserver
  • kube-scheduler
  • kube-controller-manager
  • etcd

Worker Node:

  • kubelet
  • kube-proxy

Location:

  • Systemd: journalctl -u kubelet
  • Static pods: /var/log/pods/

Logging Patterns

1. Node-level Logging

Advantages:

  • Simple
  • No additional agents

Disadvantages:

  • Logs node bilan birga yo'qoladi
  • Retention limited
  • No centralization

2. Sidecar Container

Streaming Sidecar:

apiVersion: v1
kind: Pod
metadata:
name: app-pod
spec:
containers:
- name: app
image: myapp:1.0
volumeMounts:
- name: logs
mountPath: /var/log
- name: log-shipper
image: busybox
command: ['sh', '-c', 'tail -f /var/log/app.log']
volumeMounts:
- name: logs
mountPath: /var/log
volumes:
- name: logs
emptyDir: {}

Advantages:

  • Separation of concerns
  • Multiple log streams

Disadvantages:

  • Resource overhead
  • Complexity

3. Node-level Logging Agent (DaemonSet)

Architecture:

Application → stdout/stderr

Node log file

Logging Agent (DaemonSet)

Central Log Storage

Most common pattern!

EFK Stack (Elasticsearch, Fluentd, Kibana)

Components:

  1. Fluentd (or Fluent Bit): Log collection
  2. Elasticsearch: Log storage va search
  3. Kibana: Visualization

Fluentd DaemonSet

apiVersion: apps/v1
kind: DaemonSet
metadata:
name: fluentd
namespace: kube-system
spec:
selector:
matchLabels:
app: fluentd
template:
metadata:
labels:
app: fluentd
spec:
serviceAccountName: fluentd
containers:
- name: fluentd
image: fluent/fluentd-kubernetes-daemonset:v1-debian-elasticsearch
env:
- name: FLUENT_ELASTICSEARCH_HOST
value: "elasticsearch.logging.svc.cluster.local"
- name: FLUENT_ELASTICSEARCH_PORT
value: "9200"
volumeMounts:
- name: varlog
mountPath: /var/log
- name: varlibdockercontainers
mountPath: /var/lib/docker/containers
readOnly: true
volumes:
- name: varlog
hostPath:
path: /var/log
- name: varlibdockercontainers
hostPath:
path: /var/lib/docker/containers

Elasticsearch

# Helm install
helm repo add elastic https://helm.elastic.co
helm install elasticsearch elastic/elasticsearch \
--namespace logging \
--create-namespace \
--set replicas=3

Kibana

helm install kibana elastic/kibana \
--namespace logging \
--set elasticsearchHosts=http://elasticsearch-master:9200

Access Kibana:

kubectl port-forward -n logging svc/kibana-kibana 5601:5601
# Open http://localhost:5601

Loki Stack (Grafana Loki)

Alternative to EFK: Lightweight, cost-effective.

Components:

  1. Promtail: Log collection
  2. Loki: Log storage
  3. Grafana: Visualization

Installation:

helm repo add grafana https://grafana.github.io/helm-charts
helm install loki grafana/loki-stack \
--namespace logging \
--create-namespace \
--set grafana.enabled=true \
--set prometheus.enabled=true \
--set promtail.enabled=true

Advantages over EFK:

  • Cheaper (index labels, not full text)
  • Simpler
  • Grafana integration

Structured Logging

JSON Logging:

{
"timestamp": "2024-01-01T12:00:00Z",
"level": "info",
"message": "User logged in",
"user_id": "12345",
"ip": "192.168.1.1",
"trace_id": "abc123"
}

Advantages:

  • Easy parsing
  • Rich metadata
  • Correlation (trace_id)

Libraries:

  • Go: logrus, zap
  • Python: structlog
  • Java: logback
  • Node.js: winston

Log Retention

Considerations:

  1. Storage Cost: Log data juda ko'p
  2. Compliance: Regulatory requirements
  3. Performance: Index size

Strategy:

Hot tier (0-7 days):   Fast SSD, full indexing
Warm tier (7-30 days): Slower disk, reduced replicas
Cold tier (30-90 days): Archive, searchable backup
Delete (>90 days): Compliance period tugasa

Tracing

Ta'rif

Tracing - bu distributed system'da request'ning to'liq yo'lini kuzatish.

Use Case: Microservice architecture'da bitta request ko'p service'lardan o'tadi:

Client → Gateway → Auth → UserService → Database

EmailService

Har bir hop latency qo'shadi. Qayerda bottleneck?

OpenTelemetry

Ta'rif: Vendor-neutral observability framework.

Components:

  1. SDK: Application instrumentation
  2. API: Standard interface
  3. Collector: Telemetry data collection
  4. Exporters: Backend'larga export

Supported Backends:

  • Jaeger
  • Zipkin
  • Tempo (Grafana)
  • DataDog
  • New Relic

Jaeger

Architecture:

Application (instrumented)
↓ (trace data)
Jaeger Agent

Jaeger Collector

Cassandra/Elasticsearch

Jaeger Query

Jaeger UI

Installation:

kubectl create namespace observability
kubectl apply -f https://raw.githubusercontent.com/jaegertracing/jaeger-operator/main/deploy/crds/jaegertracing.io_jaegers_crd.yaml
kubectl apply -f https://raw.githubusercontent.com/jaegertracing/jaeger-operator/main/deploy/service_account.yaml
kubectl apply -f https://raw.githubusercontent.com/jaegertracing/jaeger-operator/main/deploy/role.yaml
kubectl apply -f https://raw.githubusercontent.com/jaegertracing/jaeger-operator/main/deploy/role_binding.yaml
kubectl apply -f https://raw.githubusercontent.com/jaegertracing/jaeger-operator/main/deploy/operator.yaml

Jaeger Instance:

apiVersion: jaegertracing.io/v1
kind: Jaeger
metadata:
name: simplest
namespace: observability

Application Instrumentation

OpenTelemetry SDK:

Go:

import (
"go.opentelemetry.io/otel"
"go.opentelemetry.io/otel/exporters/jaeger"
"go.opentelemetry.io/otel/sdk/resource"
"go.opentelemetry.io/otel/sdk/trace"
)

func initTracer() func() {
exporter, _ := jaeger.New(jaeger.WithCollectorEndpoint(
jaeger.WithEndpoint("http://jaeger-collector:14268/api/traces"),
))

tp := trace.NewTracerProvider(
trace.WithBatcher(exporter),
trace.WithResource(resource.NewWithAttributes(
semconv.ServiceNameKey.String("myapp"),
)),
)

otel.SetTracerProvider(tp)
return func() { tp.Shutdown(context.Background()) }
}

func handleRequest(w http.ResponseWriter, r *http.Request) {
tracer := otel.Tracer("myapp")
ctx, span := tracer.Start(r.Context(), "handleRequest")
defer span.End()

// Call other services with ctx
result := callUserService(ctx)

span.SetAttributes(attribute.String("user.id", result.UserID))
}

Trace Context Propagation

HTTP Headers:

traceparent: 00-0af7651916cd43dd8448eb211c80319c-b7ad6b7169203331-01
tracestate: congo=t61rcWkgMzE

Automatic propagation:

  • OpenTelemetry auto-instrumentation
  • Service mesh (Istio, Linkerd)

Monitoring Stack

Complete Observability Stack

Recommended Stack:

┌─────────────────────────────────────────┐
│ Grafana (Visualization) │
├─────────────────────────────────────────┤
│ Prometheus │ Loki │ Tempo/Jaeger │
│ (Metrics) │ (Logs) │ (Traces) │
├─────────────────────────────────────────┤
│ Application (Instrumented) │
└─────────────────────────────────────────┘

Installation (Helm):

# Kube-prometheus-stack (Prometheus + Grafana)
helm install kube-prometheus-stack prometheus-community/kube-prometheus-stack \
--namespace monitoring \
--create-namespace

# Loki
helm install loki grafana/loki-stack \
--namespace monitoring \
--set grafana.enabled=false \
--set prometheus.enabled=false

# Tempo (Grafana Tempo)
helm install tempo grafana/tempo \
--namespace monitoring

Alerting

Prometheus Alerting

PrometheusRule:

apiVersion: monitoring.coreos.com/v1
kind: PrometheusRule
metadata:
name: app-alerts
namespace: monitoring
spec:
groups:
- name: app
interval: 30s
rules:
- alert: HighCPUUsage
expr: |
100 - (avg(rate(node_cpu_seconds_total{mode="idle"}[5m])) * 100) > 80
for: 5m
labels:
severity: warning
annotations:
summary: "High CPU usage detected"
description: "CPU usage is above 80% for 5 minutes"

- alert: PodCrashLooping
expr: |
rate(kube_pod_container_status_restarts_total[15m]) > 0
for: 5m
labels:
severity: critical
annotations:
summary: "Pod {{ $labels.pod }} is crash looping"
description: "Pod has restarted {{ $value }} times in 15 minutes"

- alert: HighErrorRate
expr: |
(sum(rate(http_requests_total{status=~"5.."}[5m])) /
sum(rate(http_requests_total[5m]))) * 100 > 5
for: 5m
labels:
severity: critical
annotations:
summary: "High error rate detected"
description: "Error rate is {{ $value }}%"

Alertmanager

Configuration:

apiVersion: v1
kind: ConfigMap
metadata:
name: alertmanager-config
namespace: monitoring
data:
alertmanager.yml: |
global:
resolve_timeout: 5m

route:
receiver: 'default'
group_by: ['alertname', 'cluster']
group_wait: 10s
group_interval: 10s
repeat_interval: 12h
routes:
- match:
severity: critical
receiver: 'pagerduty'
- match:
severity: warning
receiver: 'slack'

receivers:
- name: 'default'
email_configs:
- to: 'team@example.com'
from: 'alertmanager@example.com'
smarthost: 'smtp.example.com:587'

- name: 'slack'
slack_configs:
- api_url: 'https://hooks.slack.com/services/XXX'
channel: '#alerts'
title: '{{ .GroupLabels.alertname }}'
text: '{{ range .Alerts }}{{ .Annotations.description }}{{ end }}'

- name: 'pagerduty'
pagerduty_configs:
- service_key: 'YOUR_SERVICE_KEY'

Dashboarding

Grafana Dashboards

Import Dashboard:

# Port forward
kubectl port-forward -n monitoring svc/kube-prometheus-stack-grafana 3000:80

# Default credentials
# Username: admin
# Password: prom-operator

Popular Dashboards:

  • Kubernetes Cluster Monitoring (315)
  • Node Exporter Full (1860)
  • Kubernetes Pods (6417)

Custom Dashboard (JSON):

{
"dashboard": {
"title": "My App Dashboard",
"panels": [
{
"title": "Request Rate",
"targets": [
{
"expr": "rate(http_requests_total[5m])"
}
]
}
]
}
}

Xulosa

Kubernetes Observability - bu 3 pillar:

Metrics: Prometheus, Metrics Server ✅ Logging: EFK/Loki Stack ✅ Tracing: Jaeger, Tempo ✅ Alerting: Alertmanager ✅ Visualization: Grafana