Arquitetura software
Observabilidade
Observabilidade e a capacidade de entender o estado interno de um sistema atraves de suas saidas. Os tres pilares sao: Metricas, Logs e Traces.
Os Tres Pilares
| Pilar | O que mede | Ferramenta |
|---|---|---|
| Metricas | Dados numericos ao longo do tempo | Prometheus, Grafana |
| Logs | Eventos e mensagens | ELK Stack, Loki |
| Traces | Fluxo de requisicoes | Jaeger, Zipkin |
Prometheus
Sistema de monitoramento e alertas baseado em metricas.
Configuracao Basica
# prometheus.yml
global:
scrape_interval: 15s
evaluation_interval: 15s
alerting:
alertmanagers:
- static_configs:
- targets:
- alertmanager:9093
rule_files:
- "alerts/*.yml"
scrape_configs:
- job_name: 'prometheus'
static_configs:
- targets: ['localhost:9090']
- job_name: 'node-exporter'
static_configs:
- targets: ['node-exporter:9100']
- job_name: 'kubernetes-pods'
kubernetes_sd_configs:
- role: pod
relabel_configs:
- source_labels: [__meta_kubernetes_pod_annotation_prometheus_io_scrape]
action: keep
regex: trueQueries PromQL
# Taxa de requisicoes por segundo
rate(http_requests_total[5m])
# Latencia p99
histogram_quantile(0.99, rate(http_request_duration_seconds_bucket[5m]))
# Uso de CPU
100 - (avg by(instance) (irate(node_cpu_seconds_total{mode="idle"}[5m])) * 100)
# Memoria disponivel
node_memory_MemAvailable_bytes / node_memory_MemTotal_bytes * 100
# Requisicoes com erro
sum(rate(http_requests_total{status=~"5.."}[5m])) / sum(rate(http_requests_total[5m])) * 100Alertas
# alerts/app.yml
groups:
- name: app-alerts
rules:
- alert: HighErrorRate
expr: |
sum(rate(http_requests_total{status=~"5.."}[5m]))
/ sum(rate(http_requests_total[5m])) > 0.05
for: 5m
labels:
severity: critical
annotations:
summary: "Alta taxa de erros (> 5%)"
description: "{{ $labels.instance }} tem {{ $value | humanizePercentage }} de erros"
- alert: HighLatency
expr: |
histogram_quantile(0.95, rate(http_request_duration_seconds_bucket[5m])) > 1
for: 5m
labels:
severity: warning
annotations:
summary: "Latencia p95 acima de 1s"Grafana
Plataforma de visualizacao e dashboards.
Dashboard JSON
{
"dashboard": {
"title": "Application Metrics",
"panels": [
{
"title": "Request Rate",
"type": "graph",
"targets": [
{
"expr": "sum(rate(http_requests_total[5m])) by (method)",
"legendFormat": "{{method}}"
}
]
},
{
"title": "Error Rate",
"type": "stat",
"targets": [
{
"expr": "sum(rate(http_requests_total{status=~\"5..\"}[5m])) / sum(rate(http_requests_total[5m])) * 100"
}
],
"fieldConfig": {
"defaults": {
"unit": "percent",
"thresholds": {
"steps": [
{"value": 0, "color": "green"},
{"value": 1, "color": "yellow"},
{"value": 5, "color": "red"}
]
}
}
}
}
]
}
}Provisioning
# provisioning/datasources/prometheus.yml
apiVersion: 1
datasources:
- name: Prometheus
type: prometheus
access: proxy
url: http://prometheus:9090
isDefault: true
- name: Loki
type: loki
access: proxy
url: http://loki:3100ELK Stack
Elasticsearch, Logstash e Kibana para gerenciamento de logs.
Logstash Pipeline
# logstash.conf
input {
beats {
port => 5044
}
kafka {
bootstrap_servers => "kafka:9092"
topics => ["app-logs"]
codec => json
}
}
filter {
if [type] == "app" {
grok {
match => { "message" => "%{TIMESTAMP_ISO8601:timestamp} %{LOGLEVEL:level} %{GREEDYDATA:message}" }
}
date {
match => [ "timestamp", "ISO8601" ]
}
if [level] == "ERROR" {
mutate {
add_tag => ["error"]
}
}
}
geoip {
source => "client_ip"
}
}
output {
elasticsearch {
hosts => ["elasticsearch:9200"]
index => "logs-%{+YYYY.MM.dd}"
}
}Filebeat
# filebeat.yml
filebeat.inputs:
- type: container
paths:
- '/var/lib/docker/containers/*/*.log'
processors:
- add_kubernetes_metadata:
host: ${NODE_NAME}
matchers:
- logs_path:
logs_path: "/var/lib/docker/containers/"
output.elasticsearch:
hosts: ["elasticsearch:9200"]
indices:
- index: "logs-app-%{+yyyy.MM.dd}"
when.contains:
kubernetes.namespace: "production"Loki (Alternativa ao ELK)
Sistema de logs da Grafana Labs, mais leve que ELK.
Configuracao
# loki-config.yml
auth_enabled: false
server:
http_listen_port: 3100
ingester:
lifecycler:
ring:
kvstore:
store: inmemory
replication_factor: 1
schema_config:
configs:
- from: 2020-01-01
store: boltdb-shipper
object_store: filesystem
schema: v11
index:
prefix: index_
period: 24h
storage_config:
boltdb_shipper:
active_index_directory: /loki/boltdb-shipper-active
cache_location: /loki/boltdb-shipper-cache
filesystem:
directory: /loki/chunksPromtail (Agente de Coleta)
# promtail-config.yml
server:
http_listen_port: 9080
positions:
filename: /tmp/positions.yaml
clients:
- url: http://loki:3100/loki/api/v1/push
scrape_configs:
- job_name: kubernetes-pods
kubernetes_sd_configs:
- role: pod
pipeline_stages:
- docker: {}
- json:
expressions:
level: level
msg: msg
- labels:
level:LogQL (Queries)
# Logs de erro
{namespace="production"} |= "error"
# JSON parsing
{app="myapp"} | json | level="error"
# Contagem de erros por minuto
sum(rate({app="myapp"} |= "error" [1m])) by (pod)
# Latencia de logs
{app="myapp"} | json | latency > 1000Distributed Tracing
OpenTelemetry
SDK padrao para instrumentacao.
// tracing.ts
import { NodeSDK } from '@opentelemetry/sdk-node'
import { getNodeAutoInstrumentations } from '@opentelemetry/auto-instrumentations-node'
import { OTLPTraceExporter } from '@opentelemetry/exporter-trace-otlp-http'
import { Resource } from '@opentelemetry/resources'
import { SemanticResourceAttributes } from '@opentelemetry/semantic-conventions'
const sdk = new NodeSDK({
resource: new Resource({
[SemanticResourceAttributes.SERVICE_NAME]: 'my-service',
[SemanticResourceAttributes.SERVICE_VERSION]: '1.0.0',
}),
traceExporter: new OTLPTraceExporter({
url: 'http://jaeger:4318/v1/traces',
}),
instrumentations: [getNodeAutoInstrumentations()],
})
sdk.start()Jaeger
# docker-compose.yml
services:
jaeger:
image: jaegertracing/all-in-one:latest
ports:
- "16686:16686" # UI
- "4318:4318" # OTLP HTTP
- "4317:4317" # OTLP gRPC
environment:
- COLLECTOR_OTLP_ENABLED=trueStack Completa (Kubernetes)
# kube-prometheus-stack values
prometheus:
prometheusSpec:
retention: 15d
storageSpec:
volumeClaimTemplate:
spec:
storageClassName: gp3
resources:
requests:
storage: 50Gi
grafana:
adminPassword: ${GRAFANA_ADMIN_PASSWORD}
persistence:
enabled: true
size: 10Gi
dashboardProviders:
dashboardproviders.yaml:
apiVersion: 1
providers:
- name: 'default'
folder: ''
type: file
options:
path: /var/lib/grafana/dashboards
alertmanager:
config:
receivers:
- name: 'slack'
slack_configs:
- channel: '#alerts'
api_url: ${SLACK_WEBHOOK_URL}
route:
receiver: 'slack'
group_by: ['alertname', 'namespace']APM (Application Performance Monitoring)
Dynatrace
# Kubernetes OneAgent
apiVersion: dynatrace.com/v1beta1
kind: DynaKube
metadata:
name: dynakube
namespace: dynatrace
spec:
apiUrl: https://xxx.live.dynatrace.com/api
tokens: dynakube-tokens
oneAgent:
cloudNativeFullStack:
tolerations:
- effect: NoSchedule
operator: ExistsNew Relic
// newrelic.ts
import newrelic from 'newrelic'
// Instrumentacao automatica
// Configure via NEW_RELIC_LICENSE_KEY e NEW_RELIC_APP_NAME
// Custom metrics
newrelic.recordMetric('Custom/MyMetric', 100)
// Custom events
newrelic.recordCustomEvent('UserSignup', {
userId: '123',
plan: 'premium',
})SonarQube (Code Quality)
# sonar-project.properties
sonar.projectKey=myapp
sonar.projectName=My Application
sonar.sources=src
sonar.tests=tests
sonar.javascript.lcov.reportPaths=coverage/lcov.info
sonar.coverage.exclusions=**/*.test.ts,**/*.spec.ts# GitHub Actions
- name: SonarQube Scan
uses: sonarsource/sonarqube-scan-action@master
env:
SONAR_TOKEN: ${{ secrets.SONAR_TOKEN }}
SONAR_HOST_URL: ${{ secrets.SONAR_HOST_URL }}