Configuración de Prometheus

https://prometheus.io/docs/prometheus/latest/configuration/configuration/

El archivo de configuración de Prometheus se encuentra en /etc/prometheus/prometheus.yml. Este archivo de configuración es exclusivo de Prometheus y no tiene nada que ver con AlertManager o Grafana.

prometheus -h
usage: prometheus [<flags>]

The Prometheus monitoring server

Flags:
  -h, --help                     Show context-sensitive help (also try --help-long and --help-man).
      --version                  Show application version.
      --config.file="prometheus.yml"
                                 Prometheus configuration file path.
      --web.listen-address="0.0.0.0:9090"
                                 Address to listen on for UI, API, and telemetry.
      --web.config.file=""       [EXPERIMENTAL] Path to configuration file that can enable TLS or authentication.
      --web.read-timeout=5m      Maximum duration before timing out read of the request, and closing idle connections.
      --web.max-connections=512  Maximum number of simultaneous connections.
      --web.external-url=<URL>   The URL under which Prometheus is externally reachable (for example, if Prometheus is served via a reverse proxy). Used for generating relative and absolute links back to Prometheus
                                 itself. If the URL has a path portion, it will be used to prefix all HTTP endpoints served by Prometheus. If omitted, relevant URL components will be derived automatically.
      --web.route-prefix=<path>  Prefix for the internal routes of web endpoints. Defaults to path of --web.external-url.
      --web.user-assets=<path>   Path to static asset directory, available at /user.
      --web.enable-lifecycle     Enable shutdown and reload via HTTP request.
      --web.enable-admin-api     Enable API endpoints for admin control actions.
      --web.enable-remote-write-receiver
                                 Enable API endpoint accepting remote write requests.
      --web.console.templates="consoles"
                                 Path to the console template directory, available at /consoles.
      --web.console.libraries="console_libraries"
                                 Path to the console library directory.
      --web.page-title="Prometheus Time Series Collection and Processing Server"
                                 Document title of Prometheus instance.
      --web.cors.origin=".*"     Regex for CORS origin. It is fully anchored. Example: 'https?://(domain1|domain2)\.com'
      --storage.tsdb.path="data/"
                                 Base path for metrics storage. Use with server mode only.
      --storage.tsdb.retention=STORAGE.TSDB.RETENTION
                                 [DEPRECATED] How long to retain samples in storage. This flag has been deprecated, use "storage.tsdb.retention.time" instead. Use with server mode only.
      --storage.tsdb.retention.time=STORAGE.TSDB.RETENTION.TIME
                                 How long to retain samples in storage. When this flag is set it overrides "storage.tsdb.retention". If neither this flag nor "storage.tsdb.retention" nor
                                 "storage.tsdb.retention.size" is set, the retention time defaults to 15d. Units Supported: y, w, d, h, m, s, ms. Use with server mode only.
      --storage.tsdb.retention.size=STORAGE.TSDB.RETENTION.SIZE
                                 Maximum number of bytes that can be stored for blocks. A unit is required, supported units: B, KB, MB, GB, TB, PB, EB. Ex: "512MB". Based on powers-of-2, so 1KB is 1024B. Use with
                                 server mode only.
      --storage.tsdb.no-lockfile
                                 Do not create lockfile in data directory. Use with server mode only.
      --storage.tsdb.head-chunks-write-queue-size=0
                                 Size of the queue through which head chunks are written to the disk to be m-mapped, 0 disables the queue completely. Experimental. Use with server mode only.
      --storage.agent.path="data-agent/"
                                 Base path for metrics storage. Use with agent mode only.
      --storage.agent.wal-compression
                                 Compress the agent WAL. Use with agent mode only.
      --storage.agent.retention.min-time=STORAGE.AGENT.RETENTION.MIN-TIME
                                 Minimum age samples may be before being considered for deletion when the WAL is truncated Use with agent mode only.
      --storage.agent.retention.max-time=STORAGE.AGENT.RETENTION.MAX-TIME
                                 Maximum age samples may be before being forcibly deleted when the WAL is truncated Use with agent mode only.
      --storage.agent.no-lockfile
                                 Do not create lockfile in data directory. Use with agent mode only.
      --storage.remote.flush-deadline=<duration>
                                 How long to wait flushing sample on shutdown or config reload.
      --storage.remote.read-sample-limit=5e7
                                 Maximum overall number of samples to return via the remote read interface, in a single query. 0 means no limit. This limit is ignored for streamed response types. Use with server
                                 mode only.
      --storage.remote.read-concurrent-limit=10
                                 Maximum number of concurrent remote read calls. 0 means no limit. Use with server mode only.
      --storage.remote.read-max-bytes-in-frame=1048576
                                 Maximum number of bytes in a single frame for streaming remote read response types before marshalling. Note that client might have limit on frame size as well. 1MB as recommended by
                                 protobuf by default. Use with server mode only.
      --rules.alert.for-outage-tolerance=1h
                                 Max time to tolerate prometheus outage for restoring "for" state of alert. Use with server mode only.
      --rules.alert.for-grace-period=10m
                                 Minimum duration between alert and restored "for" state. This is maintained only for alerts with configured "for" time greater than grace period. Use with server mode only.
      --rules.alert.resend-delay=1m
                                 Minimum amount of time to wait before resending an alert to Alertmanager. Use with server mode only.
      --alertmanager.notification-queue-capacity=10000
                                 The capacity of the queue for pending Alertmanager notifications. Use with server mode only.
      --query.lookback-delta=5m  The maximum lookback duration for retrieving metrics during expression evaluations and federation. Use with server mode only.
      --query.timeout=2m         Maximum time a query may take before being aborted. Use with server mode only.
      --query.max-concurrency=20
                                 Maximum number of queries executed concurrently. Use with server mode only.
      --query.max-samples=50000000
                                 Maximum number of samples a single query can load into memory. Note that queries will fail if they try to load more samples than this into memory, so this also limits the number of
                                 samples a query can return. Use with server mode only.
      --enable-feature= ...      Comma separated feature names to enable. Valid options: agent, exemplar-storage, expand-external-labels, memory-snapshot-on-shutdown, promql-at-modifier, promql-negative-offset,
                                 promql-per-step-stats, remote-write-receiver (DEPRECATED), extra-scrape-metrics, new-service-discovery-manager, auto-gomaxprocs, no-default-scrape-port, native-histograms. See
                                 https://prometheus.io/docs/prometheus/latest/feature_flags/ for more details.
      --log.level=info           Only log messages with the given severity or above. One of: [debug, info, warn, error]
      --log.format=logfmt        Output format of log messages. One of: [logfmt, json]

Pero también podemos definirlas dentro del archivo de configuración de Prometheus, que es más sencillo.

En el caso del archivo Values.yaml definido en el helm de kube stack, este sirve para configurar todos los servicios al mismo tiempo.

Por defecto, la configuración de retención de las muestras recopiladas es de 15 días, pero se puede cambiar pasando algunos parámetros durante la inicialización del servicio.

Algunas cosas en Prometheus se configuran pasando parámetros como se muestra arriba y otras pueden venir del archivo de configuración.

Las configuraciones mediante parámetros son inmutables del sistema y no se leen en tiempo de ejecución. En ese caso sería necesario reiniciar el servicio.

Parámetros

Los parámetros son los argumentos pasados en la inicialización del binario para alterar algunas configuraciones de Prometheus que no están en /etc/prometheus/prometheus.yml

Algunos parámetros de almacenamiento:

--storage.tsdb.path: Donde Prometheus escribe su base de datos. Por defecto es data/.
--storage.tsdb.retention.time: Cuándo eliminar datos antiguos. Por defecto es 15d. Reemplaza storage.tsdb.retention si este parámetro está configurado con algo diferente al valor predeterminado.
--storage.tsdb.retention.size: El número máximo de bytes de bloques de almacenamiento que se deben retener. Los datos más antiguos se eliminarán primero. Por defecto es 0 o desactivado. Unidades soportadas: B, KB, MB, GB, TB, PB, EB. Ej: "512MB". Basado en potencias de 2, así que 1KB es 1024B. Solo se eliminan los bloques persistentes para respetar esta retención, aunque los bloques WAL y mapeados en memoria se cuentan en el tamaño total. Por lo tanto, el requisito mínimo de disco es el espacio pico ocupado por el directorio WAL (WAL y Checkpoint) y chunks_head (Head chunks mapeados en memoria) combinados (alcanza su pico cada 2 horas).
--storage.tsdb.wal-compression: Activa la compresión del registro de escritura anticipada (WAL). Dependiendo de tus datos, puedes esperar que el tamaño del WAL se reduzca a la mitad con poca carga extra de CPU. Este parámetro está activado por defecto.

Así quedaría la inicialización del servicio en systemd.

ExecStart=/usr/local/bin/prometheus \
    --config.file /etc/prometheus/prometheus.yml \
    --storage.tsdb.path /var/lib/prometheus/ \ # Cambiada la ruta de la base de datos
    --web.console.templates=/etc/prometheus/consoles \
    --web.console.libraries=/etc/prometheus/console_libraries \
    --web.external-url=http://34.89.26.156:9090 \
    --storage.tsdb.retention.time=30d # cambiado a 30 días de retención

Si se ejecutara manualmente solo con cambio de TSDB.

prometheus --config.file /etc/prometheus/prometheus.yml --storage.tsdb.path /var/lib prometheus/  --storage.tsdb.retention.time=30d

global:
  # Con qué frecuencia realizar scrape de targets por defecto.
  [ scrape_interval: <duration> | default = 1m ]

  # Tiempo de espera hasta que una solicitud de scrape agote el tiempo.
  [ scrape_timeout: <duration> | default = 10s ]

  # Con qué frecuencia evaluar las reglas.
  [ evaluation_interval: <duration> | default = 1m ]

  # Las etiquetas que se añaden a cualquier serie temporal o alertas al comunicarse con
  # sistemas externos (federación, almacenamiento remoto, Alertmanager).
  external_labels:
    [ <labelname>: <labelvalue> ... ]

  # Archivo en el que se registran las consultas PromQL.
  # Recargar la configuración reabrirá el archivo.
  [ query_log_file: <string> ]

# Especifica las rutas para la lista de reglas que tienes a partir de /etc/prometheus/
rule_files:
  [ - <filepath_glob> ... ]

# Especifica un conjunto de targets y parámetros que describen cómo hacer el scrape.
# En el caso general, una configuración de scrape especifica un único job.
# Los targets pueden configurarse estáticamente mediante el parámetro static_configs o descubrirse dinámicamente usando uno de los mecanismos de service discovery soportados.

scrape_configs:
  [ - <scrape_config> ... ]

# Configuraciones relacionadas con alertmanager.
alerting:
  alert_relabel_configs:
    [ - <relabel_config> ... ]
  alertmanagers:
    [ - <alertmanager_config> ... ]

# Configuraciones relacionadas con la función de escritura remota.
remote_write:
  [ - <remote_write> ... ]

# Configuraciones relacionadas con la función de lectura remota.
remote_read:
  [ - <remote_read> ... ]

# Configuraciones de almacenamiento que son recargables en tiempo de ejecución.
storage:
  [ tsdb: <tsdb> ]
  [ exemplars: <exemplars> ]

# Configura la exportación de trazas.
tracing:
  [ [<tracing_config>] ]

scrape_config

Un scrape_config tiene una lista de jobs y cada job puede configurarse de manera diferente, ya que cada uno puede tener tiempo de scrape diferente, método de autenticación diferente, service discovery diferente, tls, etc.

Generalmente el foco está en Kubernetes, pero aquí se muestran los conceptos y algunas cosas para saber de lo que Prometheus es capaz.

scrape_configs:

  - job_name: prometheus
    honor_labels: true
    # scrape_interval está definido por el valor global configurado (15s).
    # scrape_timeout está definido por el valor global predeterminado (10s).

    # metrics_path por defecto es '/metrics'
    # scheme por defecto es 'http'.
    file_sd_configs:
      - files:
          - foo/*.slow.json
          - foo/*.slow.yml
          - single/file.yml
        refresh_interval: 10m
      - files:
          - bar/*.yaml
    static_configs:
      - targets: ["localhost:9090", "localhost:9191"]
        labels:
          my: label
          your: label
    relabel_configs:
      - source_labels: [job, __meta_dns_name]
        regex: (.*)some-[regex]
        target_label: job
        replacement: foo-${1}
        # action por defecto es 'replace'
      - source_labels: [abc]
        target_label: cde
      - replacement: static
        target_label: abc
      - regex:
        replacement: static
        target_label: abc
      - source_labels: [foo]
        target_label: abc
        action: keepequal
      - source_labels: [foo]
        target_label: abc
        action: dropequal
    authorization:
      credentials_file: valid_token_file
    tls_config:
      min_version: TLS10

  - job_name: service-x
    basic_auth:
      username: admin_name
      password: "multiline\nmysecret\ntest"
    scrape_interval: 50s
    scrape_timeout: 5s
    body_size_limit: 10MB
    sample_limit: 1000
    metrics_path: /my_path
    scheme: https
    dns_sd_configs:
      - refresh_interval: 15s
        names:
          - first.dns.address.domain.com
          - second.dns.address.domain.com
      - names:
          - first.dns.address.domain.com
    relabel_configs:
      - source_labels: [job]
        regex: (.*)some-[regex]
        action: drop
      - source_labels: [__address__]
        modulus: 8
        target_label: __tmp_hash
        action: hashmod
      - source_labels: [__tmp_hash]
        regex: 1
        action: keep
      - action: labelmap
        regex: 1
      - action: labeldrop
        regex: d
      - action: labelkeep
        regex: k
    metric_relabel_configs:
      - source_labels: [__name__]
        regex: expensive_metric.*
        action: drop

  - job_name: service-y
    consul_sd_configs:
      - server: "localhost:1234"
        token: mysecret
        services: ["nginx", "cache", "mysql"]
        tags: ["canary", "v1"]
        node_meta:
          rack: "123"
        allow_stale: true
        scheme: https
        tls_config:
          ca_file: valid_ca_file
          cert_file: valid_cert_file
          key_file: valid_key_file
          insecure_skip_verify: false
    relabel_configs:
      - source_labels: [__meta_sd_consul_tags]
        separator: ","
        regex: label:([^=]+)=([^,]+)
        target_label: ${1}
        replacement: ${2}

  - job_name: service-z
    tls_config:
      cert_file: valid_cert_file
      key_file: valid_key_file
    authorization:
      credentials: mysecret

  - job_name: service-kubernetes
    kubernetes_sd_configs:
      - role: endpoints
        api_server: "https://localhost:1234"
        tls_config:
          cert_file: valid_cert_file
          key_file: valid_key_file
        basic_auth:
          username: "myusername"
          password: "mysecret"

  - job_name: service-kubernetes-namespaces
    kubernetes_sd_configs:
      - role: endpoints
        api_server: "https://localhost:1234"
        namespaces:
          names:
            - default
    basic_auth:
      username: "myusername"
      password_file: valid_password_file

  - job_name: service-kuma
    kuma_sd_configs:
      - server: http://kuma-control-plane.kuma-system.svc:5676

  - job_name: service-marathon
    marathon_sd_configs:
      - servers:
          - "https://marathon.example.com:443"
        auth_token: "mysecret"
        tls_config:
          cert_file: valid_cert_file
          key_file: valid_key_file

  - job_name: service-nomad
    nomad_sd_configs:
      - server: 'http://localhost:4646'

  - job_name: service-ec2
    ec2_sd_configs:
      - region: us-east-1
        access_key: access
        secret_key: mysecret
        profile: profile
        filters:
          - name: tag:environment
            values:
              - prod
          - name: tag:service
            values:
              - web
              - db

  - job_name: service-lightsail
    lightsail_sd_configs:
      - region: us-east-1
        access_key: access
        secret_key: mysecret
        profile: profile

  - job_name: service-azure
    azure_sd_configs:
      - environment: AzurePublicCloud
        authentication_method: OAuth
        subscription_id: 11AAAA11-A11A-111A-A111-1111A1111A11
        resource_group: my-resource-group
        tenant_id: BBBB222B-B2B2-2B22-B222-2BB2222BB2B2
        client_id: 333333CC-3C33-3333-CCC3-33C3CCCCC33C
        client_secret: mysecret
        port: 9100

  - job_name: service-nerve
    nerve_sd_configs:
      - servers:
          - localhost
        paths:
          - /monitoring

  - job_name: 0123service-xxx
    metrics_path: /metrics
    static_configs:
      - targets:
          - localhost:9090

  - job_name: badfederation
    honor_timestamps: false
    metrics_path: /federate
    static_configs:
      - targets:
          - localhost:9090

  - job_name: 測試
    metrics_path: /metrics
    static_configs:
      - targets:
          - localhost:9090

  - job_name: httpsd
    http_sd_configs:
      - url: "http://example.com/prometheus"

  - job_name: service-triton
    triton_sd_configs:
      - account: "testAccount"
        dns_suffix: "triton.example.com"
        endpoint: "triton.example.com"
        port: 9163
        refresh_interval: 1m
        version: 1
        tls_config:
          cert_file: valid_cert_file
          key_file: valid_key_file

  - job_name: digitalocean-droplets
    digitalocean_sd_configs:
      - authorization:
          credentials: abcdef

  - job_name: docker
    docker_sd_configs:
      - host: unix:///var/run/docker.sock

  - job_name: dockerswarm
    dockerswarm_sd_configs:
      - host: http://127.0.0.1:2375
        role: nodes

  - job_name: service-openstack
    openstack_sd_configs:
      - role: instance
        region: RegionOne
        port: 80
        refresh_interval: 1m
        tls_config:
          ca_file: valid_ca_file
          cert_file: valid_cert_file
          key_file: valid_key_file

  - job_name: service-puppetdb
    puppetdb_sd_configs:
      - url: https://puppetserver/
        query: 'resources { type = "Package" and title = "httpd" }'
        include_parameters: true
        port: 80
        refresh_interval: 1m
        tls_config:
          ca_file: valid_ca_file
          cert_file: valid_cert_file
          key_file: valid_key_file

  - job_name: hetzner
    relabel_configs:
      - action: uppercase
        source_labels: [instance]
        target_label: instance
    hetzner_sd_configs:
      - role: hcloud
        authorization:
          credentials: abcdef
      - role: robot
        basic_auth:
          username: abcdef
          password: abcdef

  - job_name: service-eureka
    eureka_sd_configs:
      - server: "http://eureka.example.com:8761/eureka"

  - job_name: ovhcloud
    ovhcloud_sd_configs:
      - service: vps
        endpoint: ovh-eu
        application_key: testAppKey
        application_secret: testAppSecret
        consumer_key: testConsumerKey
        refresh_interval: 1m
      - service: dedicated_server
        endpoint: ovh-eu
        application_key: testAppKey
        application_secret: testAppSecret
        consumer_key: testConsumerKey
        refresh_interval: 1m

  - job_name: scaleway
    scaleway_sd_configs:
      - role: instance
        project_id: 11111111-1111-1111-1111-111111111112
        access_key: SCWXXXXXXXXXXXXXXXXX
        secret_key: 11111111-1111-1111-1111-111111111111
      - role: baremetal
        project_id: 11111111-1111-1111-1111-111111111112
        access_key: SCWXXXXXXXXXXXXXXXXX
        secret_key: 11111111-1111-1111-1111-111111111111

  - job_name: linode-instances
    linode_sd_configs:
      - authorization:
          credentials: abcdef

  - job_name: uyuni
    uyuni_sd_configs:
      - server: https://localhost:1234
        username: gopher
        password: hole

  - job_name: ionos
    ionos_sd_configs:
      - datacenter_id: 8feda53f-15f0-447f-badf-ebe32dad2fc0
        authorization:
          credentials: abcdef

  - job_name: vultr
    vultr_sd_configs:
      - authorization:
          credentials: abcdef

tracing

tracing:
  endpoint: "localhost:4317"
  client_type: "grpc"
  headers:
    foo: "bar"
  timeout: 5s
  compression: "gzip"
  tls_config:
    cert_file: valid_cert_file
    key_file: valid_key_file
    insecure_skip_verify: true

storage

storage:
  tsdb:
    path:
    retention:
      time
      size
    wal-compression
    out_of_order_time_window: 30m


remote_write:
  - url: http://remote1/push
    name: drop_expensive
    write_relabel_configs:
      - source_labels: [__name__]
        regex: expensive.*
        action: drop
    oauth2:
      client_id: "123"
      client_secret: "456"
      token_url: "http://remote1/auth"
      tls_config:
        cert_file: valid_cert_file
        key_file: valid_key_file

  - url: http://remote2/push
    name: rw_tls
    tls_config:
      cert_file: valid_cert_file
      key_file: valid_key_file
    headers:
      name: value

remote_read:
  - url: http://remote1/read
    read_recent: true
    name: default
    enable_http2: false
  - url: http://remote3/read
    read_recent: false
    name: read_special
    required_matchers:
      job: special
    tls_config:
      cert_file: valid_cert_file
      key_file: valid_key_file

Parámetros​

scrape_config​

tracing​

storage​

Parámetros

scrape_config

tracing

storage