Monitoring, Logging va Observability

Mundarija

Observability Asoslari
Metrics
Logging
Tracing
Monitoring Stack
Alerting
Dashboarding

Observability Asoslari

Ta'rif

Observability - bu tizim ichki holatini tashqi output'lar orqali tushunish qobiliyati.

3 Pillars of Observability:

Metrics   → What is happening? (Numbers, trends)
Logs      → Why is it happening? (Events, errors)
Traces    → Where is it happening? (Request flow)

Monitoring vs Observability

Monitoring	Observability
Predefined metrics	Unknown unknowns
"Is it working?"	"Why is it not working?"
Reactive	Proactive
Dashboards, alerts	Correlation, exploration

Kubernetes Observability Challenges

Ephemeral Nature: Pod'lar keladi-ketadi
Distributed Systems: Microservices, ko'p komponentlar
Dynamic Scaling: Resource'lar o'zgaradi
Multi-tenancy: Bir nechta team, namespace

Golden Signals

Google SRE'ning to'rtta asosiy metric:

Latency: Request qancha vaqt oladi?
Traffic: Qancha request kelyapti?
Errors: Qancha request fail bo'lyapti?
Saturation: Resource qanchalik to'lgan?

Metrics

Ta'rif

Metrics - bu raqamli o'lchovlar (time-series data).

Metric Types:

Counter: Faqat oshadi (requests, errors)
Gauge: Up/down (CPU, memory, connections)
Histogram: Distribution (request duration)
Summary: Quantiles (p50, p95, p99)

Kubernetes Metrics

1. Node Metrics

CPU usage
Memory usage
Disk I/O
Network I/O
Filesystem usage

2. Pod Metrics

CPU usage
Memory usage
Network I/O
Container restarts

3. Container Metrics

CPU usage
Memory usage
Disk usage

4. Cluster Metrics

Total nodes
Total pods
API server requests
etcd performance

Metrics Server

Ta'rif: Kubernetes cluster ichidagi lightweight metrics aggregator.

O'rnatish:

kubectl apply -f https://github.com/kubernetes-sigs/metrics-server/releases/latest/download/components.yaml

Resource Metrics API:

# Node metrics
kubectl top nodes

# Pod metrics
kubectl top pods

# Specific namespace
kubectl top pods -n kube-system

# Container metrics
kubectl top pods --containers

Output:

NAME           CPU(cores)   MEMORY(bytes)
node-1         250m         1024Mi
node-2         180m         890Mi

NAME                     CPU(cores)   MEMORY(bytes)
nginx-deployment-abc     10m          50Mi
redis-master             5m           30Mi

Limitations:

Faqat current metrics (no history)
Limited metrics (CPU, memory)
Not for alerting

Prometheus

Ta'rif: Open-source monitoring va alerting toolkit.

Architecture:

Prometheus Server
  ↓ (scrape)
Target Endpoints (exporters)
  ↓
Time-Series Database
  ↓
Query (PromQL)
  ↓
Grafana / Alertmanager

Components:

Prometheus Server: Metrics scraping va storage
Exporters: Metrics expose qilish
Alertmanager: Alert management
Pushgateway: Short-lived job'lar uchun

Prometheus Installation

Helm:

# Add repo
helm repo add prometheus-community https://prometheus-community.github.io/helm-charts
helm repo update

# Install
helm install prometheus prometheus-community/kube-prometheus-stack \
  --namespace monitoring \
  --create-namespace

Verify:

kubectl get pods -n monitoring
kubectl get svc -n monitoring

ServiceMonitor

Ta'rif: Prometheus'ga qaysi service'larni monitor qilishni aytish.

YAML:

apiVersion: monitoring.coreos.com/v1
kind: ServiceMonitor
metadata:
  name: myapp-monitor
  namespace: monitoring
spec:
  selector:
    matchLabels:
      app: myapp
  endpoints:
  - port: metrics
    interval: 30s
    path: /metrics

Application Service:

apiVersion: v1
kind: Service
metadata:
  name: myapp
  labels:
    app: myapp
spec:
  ports:
  - name: metrics
    port: 8080
    targetPort: 8080
  selector:
    app: myapp

PromQL (Prometheus Query Language)

Basic Queries:

Instant Vector:

# CPU usage
node_cpu_seconds_total

# Memory usage
node_memory_MemAvailable_bytes

# HTTP requests
http_requests_total

Filtering:

# By label
http_requests_total{job="api-server"}

# Multiple labels
http_requests_total{job="api-server", method="GET"}

# Regex
http_requests_total{status=~"5.."}

Aggregation:

# Sum
sum(http_requests_total)

# Average
avg(node_cpu_seconds_total)

# By label
sum(http_requests_total) by (job)

Rate:

# Requests per second
rate(http_requests_total[5m])

# Increase
increase(http_requests_total[1h])

Functions:

# Percentile
histogram_quantile(0.95, rate(http_request_duration_seconds_bucket[5m]))

# Round
round(node_memory_MemAvailable_bytes / 1024 / 1024)

Complex Queries:

# CPU usage percentage
100 - (avg(rate(node_cpu_seconds_total{mode="idle"}[5m])) * 100)

# Memory usage percentage
(1 - (node_memory_MemAvailable_bytes / node_memory_MemTotal_bytes)) * 100

# Request error rate
sum(rate(http_requests_total{status=~"5.."}[5m])) / 
sum(rate(http_requests_total[5m])) * 100

Application Metrics

Instrumentation:

Go (Prometheus client):

import (
    "github.com/prometheus/client_golang/prometheus"
    "github.com/prometheus/client_golang/prometheus/promhttp"
)

var (
    httpRequestsTotal = prometheus.NewCounterVec(
        prometheus.CounterOpts{
            Name: "http_requests_total",
            Help: "Total HTTP requests",
        },
        []string{"method", "endpoint", "status"},
    )
    
    httpRequestDuration = prometheus.NewHistogramVec(
        prometheus.HistogramOpts{
            Name: "http_request_duration_seconds",
            Help: "HTTP request duration",
            Buckets: []float64{.005, .01, .025, .05, .1, .25, .5, 1, 2.5, 5, 10},
        },
        []string{"method", "endpoint"},
    )
)

func init() {
    prometheus.MustRegister(httpRequestsTotal)
    prometheus.MustRegister(httpRequestDuration)
}

func handler(w http.ResponseWriter, r *http.Request) {
    timer := prometheus.NewTimer(httpRequestDuration.WithLabelValues(r.Method, r.URL.Path))
    defer timer.ObserveDuration()
    
    // Handle request
    
    httpRequestsTotal.WithLabelValues(r.Method, r.URL.Path, "200").Inc()
}

func main() {
    http.Handle("/metrics", promhttp.Handler())
    http.HandleFunc("/api", handler)
    http.ListenAndServe(":8080", nil)
}

Python (Prometheus client):

from prometheus_client import Counter, Histogram, start_http_server
import time

request_count = Counter('http_requests_total', 'Total requests',
                       ['method', 'endpoint', 'status'])
request_duration = Histogram('http_request_duration_seconds', 
                            'Request duration',
                            ['method', 'endpoint'])

@request_duration.labels(method='GET', endpoint='/api').time()
def handle_request():
    # Process request
    time.sleep(0.1)
    request_count.labels(method='GET', endpoint='/api', status='200').inc()

if __name__ == '__main__':
    start_http_server(8080)
    # Application logic

Custom Metrics

Custom Resource:

apiVersion: v1
kind: Service
metadata:
  name: myapp
  annotations:
    prometheus.io/scrape: "true"
    prometheus.io/port: "8080"
    prometheus.io/path: "/metrics"
spec:
  ports:
  - name: metrics
    port: 8080
  selector:
    app: myapp

Logging

Ta'rif

Logging - bu tizim event'larini yozib olish.

Log Types:

Application Logs: Application'dan
System Logs: OS, kernel
Audit Logs: Security events

Kubernetes Logging

1. Container Logs

Stdout/Stderr:

Container'dagi application stdout/stderr'ga yozadi:

# Simple logging
echo "Hello World"

# Structured logging (JSON)
echo '{"level":"info","message":"Hello World","timestamp":"2024-01-01T00:00:00Z"}'

View logs:

# Current logs
kubectl logs <pod-name>

# Previous container
kubectl logs <pod-name> --previous

# Specific container
kubectl logs <pod-name> -c <container-name>

# Follow logs
kubectl logs <pod-name> -f

# Since
kubectl logs <pod-name> --since=1h

# Tail
kubectl logs <pod-name> --tail=100

2. Node Logs

Node logging driver (Docker, containerd) log'larni file'ga yozadi:

Docker:

/var/lib/docker/containers/<container-id>/<container-id>-json.log

Containerd:

/var/log/pods/<namespace>_<pod-name>_<pod-id>/<container-name>/<restart-count>.log

3. Cluster Component Logs

Control Plane:

kube-apiserver
kube-scheduler
kube-controller-manager
etcd

Worker Node:

kubelet
kube-proxy

Location:

Systemd: journalctl -u kubelet
Static pods: /var/log/pods/

Logging Patterns

1. Node-level Logging

Advantages:

Simple
No additional agents

Disadvantages:

Logs node bilan birga yo'qoladi
Retention limited
No centralization

2. Sidecar Container

Streaming Sidecar:

apiVersion: v1
kind: Pod
metadata:
  name: app-pod
spec:
  containers:
  - name: app
    image: myapp:1.0
    volumeMounts:
    - name: logs
      mountPath: /var/log
  - name: log-shipper
    image: busybox
    command: ['sh', '-c', 'tail -f /var/log/app.log']
    volumeMounts:
    - name: logs
      mountPath: /var/log
  volumes:
  - name: logs
    emptyDir: {}

Advantages:

Separation of concerns
Multiple log streams

Disadvantages:

Resource overhead
Complexity

3. Node-level Logging Agent (DaemonSet)

Architecture:

Application → stdout/stderr
                ↓
         Node log file
                ↓
    Logging Agent (DaemonSet)
                ↓
     Central Log Storage

Most common pattern!

EFK Stack (Elasticsearch, Fluentd, Kibana)

Components:

Fluentd (or Fluent Bit): Log collection
Elasticsearch: Log storage va search
Kibana: Visualization

Fluentd DaemonSet

apiVersion: apps/v1
kind: DaemonSet
metadata:
  name: fluentd
  namespace: kube-system
spec:
  selector:
    matchLabels:
      app: fluentd
  template:
    metadata:
      labels:
        app: fluentd
    spec:
      serviceAccountName: fluentd
      containers:
      - name: fluentd
        image: fluent/fluentd-kubernetes-daemonset:v1-debian-elasticsearch
        env:
        - name: FLUENT_ELASTICSEARCH_HOST
          value: "elasticsearch.logging.svc.cluster.local"
        - name: FLUENT_ELASTICSEARCH_PORT
          value: "9200"
        volumeMounts:
        - name: varlog
          mountPath: /var/log
        - name: varlibdockercontainers
          mountPath: /var/lib/docker/containers
          readOnly: true
      volumes:
      - name: varlog
        hostPath:
          path: /var/log
      - name: varlibdockercontainers
        hostPath:
          path: /var/lib/docker/containers

Elasticsearch

# Helm install
helm repo add elastic https://helm.elastic.co
helm install elasticsearch elastic/elasticsearch \
  --namespace logging \
  --create-namespace \
  --set replicas=3

Kibana

helm install kibana elastic/kibana \
  --namespace logging \
  --set elasticsearchHosts=http://elasticsearch-master:9200

Access Kibana:

kubectl port-forward -n logging svc/kibana-kibana 5601:5601
# Open http://localhost:5601

Loki Stack (Grafana Loki)

Alternative to EFK: Lightweight, cost-effective.

Components:

Promtail: Log collection
Loki: Log storage
Grafana: Visualization

Installation:

helm repo add grafana https://grafana.github.io/helm-charts
helm install loki grafana/loki-stack \
  --namespace logging \
  --create-namespace \
  --set grafana.enabled=true \
  --set prometheus.enabled=true \
  --set promtail.enabled=true

Advantages over EFK:

Cheaper (index labels, not full text)
Simpler
Grafana integration

Structured Logging

JSON Logging:

{
  "timestamp": "2024-01-01T12:00:00Z",
  "level": "info",
  "message": "User logged in",
  "user_id": "12345",
  "ip": "192.168.1.1",
  "trace_id": "abc123"
}

Advantages:

Easy parsing
Rich metadata
Correlation (trace_id)

Libraries:

Go: logrus, zap
Python: structlog
Java: logback
Node.js: winston

Log Retention

Considerations:

Storage Cost: Log data juda ko'p
Compliance: Regulatory requirements
Performance: Index size

Strategy:

Hot tier (0-7 days):   Fast SSD, full indexing
Warm tier (7-30 days): Slower disk, reduced replicas
Cold tier (30-90 days): Archive, searchable backup
Delete (>90 days):     Compliance period tugasa

Tracing

Ta'rif

Tracing - bu distributed system'da request'ning to'liq yo'lini kuzatish.

Use Case: Microservice architecture'da bitta request ko'p service'lardan o'tadi:

Client → Gateway → Auth → UserService → Database
                                ↓
                           EmailService

Har bir hop latency qo'shadi. Qayerda bottleneck?

OpenTelemetry

Ta'rif: Vendor-neutral observability framework.

Components:

SDK: Application instrumentation
API: Standard interface
Collector: Telemetry data collection
Exporters: Backend'larga export

Supported Backends:

Jaeger
Zipkin
Tempo (Grafana)
DataDog
New Relic

Jaeger

Architecture:

Application (instrumented)
        ↓ (trace data)
   Jaeger Agent
        ↓
  Jaeger Collector
        ↓
    Cassandra/Elasticsearch
        ↓
    Jaeger Query
        ↓
     Jaeger UI

Installation:

kubectl create namespace observability
kubectl apply -f https://raw.githubusercontent.com/jaegertracing/jaeger-operator/main/deploy/crds/jaegertracing.io_jaegers_crd.yaml
kubectl apply -f https://raw.githubusercontent.com/jaegertracing/jaeger-operator/main/deploy/service_account.yaml
kubectl apply -f https://raw.githubusercontent.com/jaegertracing/jaeger-operator/main/deploy/role.yaml
kubectl apply -f https://raw.githubusercontent.com/jaegertracing/jaeger-operator/main/deploy/role_binding.yaml
kubectl apply -f https://raw.githubusercontent.com/jaegertracing/jaeger-operator/main/deploy/operator.yaml

Jaeger Instance:

apiVersion: jaegertracing.io/v1
kind: Jaeger
metadata:
  name: simplest
  namespace: observability

Application Instrumentation

OpenTelemetry SDK:

Go:

import (
    "go.opentelemetry.io/otel"
    "go.opentelemetry.io/otel/exporters/jaeger"
    "go.opentelemetry.io/otel/sdk/resource"
    "go.opentelemetry.io/otel/sdk/trace"
)

func initTracer() func() {
    exporter, _ := jaeger.New(jaeger.WithCollectorEndpoint(
        jaeger.WithEndpoint("http://jaeger-collector:14268/api/traces"),
    ))
    
    tp := trace.NewTracerProvider(
        trace.WithBatcher(exporter),
        trace.WithResource(resource.NewWithAttributes(
            semconv.ServiceNameKey.String("myapp"),
        )),
    )
    
    otel.SetTracerProvider(tp)
    return func() { tp.Shutdown(context.Background()) }
}

func handleRequest(w http.ResponseWriter, r *http.Request) {
    tracer := otel.Tracer("myapp")
    ctx, span := tracer.Start(r.Context(), "handleRequest")
    defer span.End()
    
    // Call other services with ctx
    result := callUserService(ctx)
    
    span.SetAttributes(attribute.String("user.id", result.UserID))
}

Trace Context Propagation

HTTP Headers:

traceparent: 00-0af7651916cd43dd8448eb211c80319c-b7ad6b7169203331-01
tracestate: congo=t61rcWkgMzE

Automatic propagation:

OpenTelemetry auto-instrumentation
Service mesh (Istio, Linkerd)

Monitoring Stack

Complete Observability Stack

Recommended Stack:

┌─────────────────────────────────────────┐
│          Grafana (Visualization)        │
├─────────────────────────────────────────┤
│  Prometheus  │  Loki  │  Tempo/Jaeger  │
│  (Metrics)   │ (Logs) │   (Traces)     │
├─────────────────────────────────────────┤
│        Application (Instrumented)       │
└─────────────────────────────────────────┘

Installation (Helm):

# Kube-prometheus-stack (Prometheus + Grafana)
helm install kube-prometheus-stack prometheus-community/kube-prometheus-stack \
  --namespace monitoring \
  --create-namespace

# Loki
helm install loki grafana/loki-stack \
  --namespace monitoring \
  --set grafana.enabled=false \
  --set prometheus.enabled=false

# Tempo (Grafana Tempo)
helm install tempo grafana/tempo \
  --namespace monitoring

Alerting

Prometheus Alerting

PrometheusRule:

apiVersion: monitoring.coreos.com/v1
kind: PrometheusRule
metadata:
  name: app-alerts
  namespace: monitoring
spec:
  groups:
  - name: app
    interval: 30s
    rules:
    - alert: HighCPUUsage
      expr: |
        100 - (avg(rate(node_cpu_seconds_total{mode="idle"}[5m])) * 100) > 80
      for: 5m
      labels:
        severity: warning
      annotations:
        summary: "High CPU usage detected"
        description: "CPU usage is above 80% for 5 minutes"
    
    - alert: PodCrashLooping
      expr: |
        rate(kube_pod_container_status_restarts_total[15m]) > 0
      for: 5m
      labels:
        severity: critical
      annotations:
        summary: "Pod {{ $labels.pod }} is crash looping"
        description: "Pod has restarted {{ $value }} times in 15 minutes"
    
    - alert: HighErrorRate
      expr: |
        (sum(rate(http_requests_total{status=~"5.."}[5m])) / 
         sum(rate(http_requests_total[5m]))) * 100 > 5
      for: 5m
      labels:
        severity: critical
      annotations:
        summary: "High error rate detected"
        description: "Error rate is {{ $value }}%"

Alertmanager

Configuration:

apiVersion: v1
kind: ConfigMap
metadata:
  name: alertmanager-config
  namespace: monitoring
data:
  alertmanager.yml: |
    global:
      resolve_timeout: 5m
    
    route:
      receiver: 'default'
      group_by: ['alertname', 'cluster']
      group_wait: 10s
      group_interval: 10s
      repeat_interval: 12h
      routes:
      - match:
          severity: critical
        receiver: 'pagerduty'
      - match:
          severity: warning
        receiver: 'slack'
    
    receivers:
    - name: 'default'
      email_configs:
      - to: 'team@example.com'
        from: 'alertmanager@example.com'
        smarthost: 'smtp.example.com:587'
    
    - name: 'slack'
      slack_configs:
      - api_url: 'https://hooks.slack.com/services/XXX'
        channel: '#alerts'
        title: '{{ .GroupLabels.alertname }}'
        text: '{{ range .Alerts }}{{ .Annotations.description }}{{ end }}'
    
    - name: 'pagerduty'
      pagerduty_configs:
      - service_key: 'YOUR_SERVICE_KEY'

Dashboarding

Grafana Dashboards

Import Dashboard:

# Port forward
kubectl port-forward -n monitoring svc/kube-prometheus-stack-grafana 3000:80

# Default credentials
# Username: admin
# Password: prom-operator

Popular Dashboards:

Kubernetes Cluster Monitoring (315)
Node Exporter Full (1860)
Kubernetes Pods (6417)

Custom Dashboard (JSON):

{
  "dashboard": {
    "title": "My App Dashboard",
    "panels": [
      {
        "title": "Request Rate",
        "targets": [
          {
            "expr": "rate(http_requests_total[5m])"
          }
        ]
      }
    ]
  }
}

Xulosa

Kubernetes Observability - bu 3 pillar:

✅ Metrics: Prometheus, Metrics Server ✅ Logging: EFK/Loki Stack ✅ Tracing: Jaeger, Tempo ✅ Alerting: Alertmanager ✅ Visualization: Grafana

Mundarija​

Observability Asoslari​

Ta'rif​

Monitoring vs Observability​

Kubernetes Observability Challenges​

Golden Signals​

Metrics​

Ta'rif​

Kubernetes Metrics​

1. Node Metrics​

2. Pod Metrics​

3. Container Metrics​

4. Cluster Metrics​

Metrics Server​

Prometheus​

Prometheus Installation​

ServiceMonitor​

PromQL (Prometheus Query Language)​

Application Metrics​

Custom Metrics​

Logging​

Ta'rif​

Kubernetes Logging​

1. Container Logs​

2. Node Logs​

3. Cluster Component Logs​

Logging Patterns​

1. Node-level Logging​

2. Sidecar Container​

3. Node-level Logging Agent (DaemonSet)​

EFK Stack (Elasticsearch, Fluentd, Kibana)​

Fluentd DaemonSet​

Elasticsearch​

Kibana​

Loki Stack (Grafana Loki)​

Structured Logging​

Log Retention​

Tracing​

Ta'rif​

OpenTelemetry​

Jaeger​

Application Instrumentation​

Trace Context Propagation​

Monitoring Stack​

Complete Observability Stack​

Alerting​

Prometheus Alerting​

Alertmanager​

Dashboarding​

Grafana Dashboards​

Xulosa​

Mundarija

Observability Asoslari

Ta'rif

Monitoring vs Observability

Kubernetes Observability Challenges

Golden Signals

Metrics

Ta'rif

Kubernetes Metrics

1. Node Metrics

2. Pod Metrics

3. Container Metrics

4. Cluster Metrics

Metrics Server

Prometheus

Prometheus Installation

ServiceMonitor

PromQL (Prometheus Query Language)

Application Metrics

Custom Metrics

Logging

Ta'rif

Kubernetes Logging

1. Container Logs

2. Node Logs

3. Cluster Component Logs

Logging Patterns

1. Node-level Logging

2. Sidecar Container

3. Node-level Logging Agent (DaemonSet)

EFK Stack (Elasticsearch, Fluentd, Kibana)

Fluentd DaemonSet

Elasticsearch

Kibana

Loki Stack (Grafana Loki)

Structured Logging

Log Retention

Tracing

Ta'rif

OpenTelemetry

Jaeger

Application Instrumentation

Trace Context Propagation

Monitoring Stack

Complete Observability Stack

Alerting

Prometheus Alerting

Alertmanager

Dashboarding

Grafana Dashboards

Xulosa