Skip to main content

Configuração

https://prometheus.io/docs/prometheus/latest/configuration/configuration/

O arquivo de configuração do Prometheus fica em /etc/prometheus/prometheus.yml. Este arquivo de configuração é somente do Prometheus e não tem nada a ver com AlertManager ou Grafana.

prometheus -h  
usage: prometheus [<flags>]

The Prometheus monitoring server

Flags:
-h, --help Show context-sensitive help (also try --help-long and --help-man).
--version Show application version.
--config.file="prometheus.yml"
Prometheus configuration file path.
--web.listen-address="0.0.0.0:9090"
Address to listen on for UI, API, and telemetry.
--web.config.file="" [EXPERIMENTAL] Path to configuration file that can enable TLS or authentication.
--web.read-timeout=5m Maximum duration before timing out read of the request, and closing idle connections.
--web.max-connections=512 Maximum number of simultaneous connections.
--web.external-url=<URL> The URL under which Prometheus is externally reachable (for example, if Prometheus is served via a reverse proxy). Used for generating relative and absolute links back to Prometheus
itself. If the URL has a path portion, it will be used to prefix all HTTP endpoints served by Prometheus. If omitted, relevant URL components will be derived automatically.
--web.route-prefix=<path> Prefix for the internal routes of web endpoints. Defaults to path of --web.external-url.
--web.user-assets=<path> Path to static asset directory, available at /user.
--web.enable-lifecycle Enable shutdown and reload via HTTP request.
--web.enable-admin-api Enable API endpoints for admin control actions.
--web.enable-remote-write-receiver
Enable API endpoint accepting remote write requests.
--web.console.templates="consoles"
Path to the console template directory, available at /consoles.
--web.console.libraries="console_libraries"
Path to the console library directory.
--web.page-title="Prometheus Time Series Collection and Processing Server"
Document title of Prometheus instance.
--web.cors.origin=".*" Regex for CORS origin. It is fully anchored. Example: 'https?://(domain1|domain2)\.com'
--storage.tsdb.path="data/"
Base path for metrics storage. Use with server mode only.
--storage.tsdb.retention=STORAGE.TSDB.RETENTION
[DEPRECATED] How long to retain samples in storage. This flag has been deprecated, use "storage.tsdb.retention.time" instead. Use with server mode only.
--storage.tsdb.retention.time=STORAGE.TSDB.RETENTION.TIME
How long to retain samples in storage. When this flag is set it overrides "storage.tsdb.retention". If neither this flag nor "storage.tsdb.retention" nor
"storage.tsdb.retention.size" is set, the retention time defaults to 15d. Units Supported: y, w, d, h, m, s, ms. Use with server mode only.
--storage.tsdb.retention.size=STORAGE.TSDB.RETENTION.SIZE
Maximum number of bytes that can be stored for blocks. A unit is required, supported units: B, KB, MB, GB, TB, PB, EB. Ex: "512MB". Based on powers-of-2, so 1KB is 1024B. Use with
server mode only.
--storage.tsdb.no-lockfile
Do not create lockfile in data directory. Use with server mode only.
--storage.tsdb.head-chunks-write-queue-size=0
Size of the queue through which head chunks are written to the disk to be m-mapped, 0 disables the queue completely. Experimental. Use with server mode only.
--storage.agent.path="data-agent/"
Base path for metrics storage. Use with agent mode only.
--storage.agent.wal-compression
Compress the agent WAL. Use with agent mode only.
--storage.agent.retention.min-time=STORAGE.AGENT.RETENTION.MIN-TIME
Minimum age samples may be before being considered for deletion when the WAL is truncated Use with agent mode only.
--storage.agent.retention.max-time=STORAGE.AGENT.RETENTION.MAX-TIME
Maximum age samples may be before being forcibly deleted when the WAL is truncated Use with agent mode only.
--storage.agent.no-lockfile
Do not create lockfile in data directory. Use with agent mode only.
--storage.remote.flush-deadline=<duration>
How long to wait flushing sample on shutdown or config reload.
--storage.remote.read-sample-limit=5e7
Maximum overall number of samples to return via the remote read interface, in a single query. 0 means no limit. This limit is ignored for streamed response types. Use with server
mode only.
--storage.remote.read-concurrent-limit=10
Maximum number of concurrent remote read calls. 0 means no limit. Use with server mode only.
--storage.remote.read-max-bytes-in-frame=1048576
Maximum number of bytes in a single frame for streaming remote read response types before marshalling. Note that client might have limit on frame size as well. 1MB as recommended by
protobuf by default. Use with server mode only.
--rules.alert.for-outage-tolerance=1h
Max time to tolerate prometheus outage for restoring "for" state of alert. Use with server mode only.
--rules.alert.for-grace-period=10m
Minimum duration between alert and restored "for" state. This is maintained only for alerts with configured "for" time greater than grace period. Use with server mode only.
--rules.alert.resend-delay=1m
Minimum amount of time to wait before resending an alert to Alertmanager. Use with server mode only.
--alertmanager.notification-queue-capacity=10000
The capacity of the queue for pending Alertmanager notifications. Use with server mode only.
--query.lookback-delta=5m The maximum lookback duration for retrieving metrics during expression evaluations and federation. Use with server mode only.
--query.timeout=2m Maximum time a query may take before being aborted. Use with server mode only.
--query.max-concurrency=20
Maximum number of queries executed concurrently. Use with server mode only.
--query.max-samples=50000000
Maximum number of samples a single query can load into memory. Note that queries will fail if they try to load more samples than this into memory, so this also limits the number of
samples a query can return. Use with server mode only.
--enable-feature= ... Comma separated feature names to enable. Valid options: agent, exemplar-storage, expand-external-labels, memory-snapshot-on-shutdown, promql-at-modifier, promql-negative-offset,
promql-per-step-stats, remote-write-receiver (DEPRECATED), extra-scrape-metrics, new-service-discovery-manager, auto-gomaxprocs, no-default-scrape-port, native-histograms. See
https://prometheus.io/docs/prometheus/latest/feature_flags/ for more details.
--log.level=info Only log messages with the given severity or above. One of: [debug, info, warn, error]
--log.format=logfmt Output format of log messages. One of: [logfmt, json]

Mas também podemos defini-las dentro do arquivo de configuração do prometheus que é mais simples.

No caso do arquivo de Values.yaml definido no helm do kube stack este serve para configurar todos os serviços ao mesmo tempo.

Por padrão a configuração de retenção das amostras coletadas é de 15 dias, mas pode ser mudada passando algumas flags durantte a inicialização do serviço.

Algumas coisas no Prometheus são configurada passando flags como mostrado acima e outras podem vir do arquivo de configuração.

Configurações via flag são imutáveis do sistema e não são lidas em tempo de execução. Nesse caso seria necessário reiniciar o serviço.

Flags

As flags são os parâmetros passados na inicialização do binário para alterar algumas configurações do prometheus que não ficam no /etc/prommetheus/prometheus.yml

Algumas flags de storage:

  • --storage.tsdb.path: Onde o Prometheus grava seu banco de dados. O padrão é data/.
  • --storage.tsdb.retention.time: Quando remover dados antigos. O padrão é 15d. Substitui storage.tsdb.retentions e este sinalizador estiver definido como algo diferente do padrão.
  • --storage.tsdb.retention.size: O número máximo de bytes de blocos de armazenamento a serem retidos. Os dados mais antigos serão removidos primeiro. O padrão é 0 ou desativado. Unidades suportadas: B, KB, MB, GB, TB, PB, EB. Ex: "512MB". Baseado em potências de 2, então 1 KB é 1024B. Apenas os blocos persistentes são excluídos para honrar essa retenção, embora os blocos WAL e mapeados por m sejam contados no tamanho total. Portanto, o requisito mínimo para o disco é o espaço de pico ocupado pelo diretório ( walWAL e Checkpoint) e chunks_head(m-mapped Head chunks) combinados (picos a cada 2 horas).
  • --storage.tsdb.wal-compression: Ativa a compactação do log write-ahead (WAL). Dependendo de seus dados, você pode esperar que o tamanho do WAL caia pela metade com pouca carga extra de CPU. Este sinalizador é ativado por padrão.

Aqui como ficaria a inicialização do serviço no systemd.

ExecStart=/usr/local/bin/prometheus \
--config.file /etc/prometheus/prometheus.yml \
--storage.tsdb.path /var/lib/prometheus/ \ #Alterado o path do database
--web.console.templates=/etc/prometheus/consoles \
--web.console.libraries=/etc/prometheus/console_libraries \
--web.external-url=http://34.89.26.156:9090 \
--storage.tsdb.retention.time=30d # alterado para 1 ano de retenção

Se fosse executado manualmente somente com alteração TSDB.

prometheus --config.file /etc/prometheus/prometheus.yml --storage.tsdb.path /var/lib prometheus/  --storage.tsdb.retention.time=30d
global:
# How frequently to scrape targets by default.
[ scrape_interval: <duration> | default = 1m ]

# How long until a scrape request times out.
[ scrape_timeout: <duration> | default = 10s ]

# How frequently to evaluate rules.
[ evaluation_interval: <duration> | default = 1m ]

# The labels to add to any time series or alerts when communicating with
# external systems (federation, remote storage, Alertmanager).
external_labels:
[ <labelname>: <labelvalue> ... ]

# File to which PromQL queries are logged.
# Reloading the configuration will reopen the file.
[ query_log_file: <string> ]

# Especifica os path para a lista de rules que voce tem a partir de /etc/prometheus/
rule_files:
[ - <filepath_glob> ... ]

# Especifica um conjunto de targets e parâmetros que descrevem como fazer o scrape.
# No caso geral, uma configuração de scrape especifica um único job.
# Os targets podem ser configurados estaticamente por meio do static_configsparâmetro ou descobertos dinamicamente usando um dos mecanismos de service discover suportados.

scrape_configs:
[ - <scrape_config> ... ]

# Configurações relacionadas ao alertmanager.
alerting:
alert_relabel_configs:
[ - <relabel_config> ... ]
alertmanagers:
[ - <alertmanager_config> ... ]

# Settings related to the remote write feature.
remote_write:
[ - <remote_write> ... ]

# Settings related to the remote read feature.
remote_read:
[ - <remote_read> ... ]

# Storage related settings that are runtime reloadable.
storage:
[ tsdb: <tsdb> ]
[ exemplars: <exemplars> ]

# Configures exporting traces.
tracing:
[ [<tracing_config>] ]

scrape_config

Um scrape_config possue uma lista de jobs e cada job pode ser configurado de maneira diferentes, pois cada um pode ter tempo de scrape diferente, método de autenticação diferente, service discovery diferente, tls, etc.

Geralmente o foco é kubernetes, mas aqui mostra os conceitos e algumas coisas para saber do que o prometheus é capaz.

scrape_configs:

- job_name: prometheus
honor_labels: true
# scrape_interval is defined by the configured global (15s).
# scrape_timeout is defined by the global default (10s).

# metrics_path defaults to '/metrics'
# scheme defaults to 'http'.
file_sd_configs:
- files:
- foo/*.slow.json
- foo/*.slow.yml
- single/file.yml
refresh_interval: 10m
- files:
- bar/*.yaml
static_configs:
- targets: ["localhost:9090", "localhost:9191"]
labels:
my: label
your: label
relabel_configs:
- source_labels: [job, __meta_dns_name]
regex: (.*)some-[regex]
target_label: job
replacement: foo-${1}
# action defaults to 'replace'
- source_labels: [abc]
target_label: cde
- replacement: static
target_label: abc
- regex:
replacement: static
target_label: abc
- source_labels: [foo]
target_label: abc
action: keepequal
- source_labels: [foo]
target_label: abc
action: dropequal
authorization:
credentials_file: valid_token_file
tls_config:
min_version: TLS10

- job_name: service-x
basic_auth:
username: admin_name
password: "multiline\nmysecret\ntest"
scrape_interval: 50s
scrape_timeout: 5s
body_size_limit: 10MB
sample_limit: 1000
metrics_path: /my_path
scheme: https
dns_sd_configs:
- refresh_interval: 15s
names:
- first.dns.address.domain.com
- second.dns.address.domain.com
- names:
- first.dns.address.domain.com
relabel_configs:
- source_labels: [job]
regex: (.*)some-[regex]
action: drop
- source_labels: [__address__]
modulus: 8
target_label: __tmp_hash
action: hashmod
- source_labels: [__tmp_hash]
regex: 1
action: keep
- action: labelmap
regex: 1
- action: labeldrop
regex: d
- action: labelkeep
regex: k
metric_relabel_configs:
- source_labels: [__name__]
regex: expensive_metric.*
action: drop

- job_name: service-y
consul_sd_configs:
- server: "localhost:1234"
token: mysecret
services: ["nginx", "cache", "mysql"]
tags: ["canary", "v1"]
node_meta:
rack: "123"
allow_stale: true
scheme: https
tls_config:
ca_file: valid_ca_file
cert_file: valid_cert_file
key_file: valid_key_file
insecure_skip_verify: false
relabel_configs:
- source_labels: [__meta_sd_consul_tags]
separator: ","
regex: label:([^=]+)=([^,]+)
target_label: ${1}
replacement: ${2}

- job_name: service-z
tls_config:
cert_file: valid_cert_file
key_file: valid_key_file
authorization:
credentials: mysecret

- job_name: service-kubernetes
kubernetes_sd_configs:
- role: endpoints
api_server: "https://localhost:1234"
tls_config:
cert_file: valid_cert_file
key_file: valid_key_file
basic_auth:
username: "myusername"
password: "mysecret"

- job_name: service-kubernetes-namespaces
kubernetes_sd_configs:
- role: endpoints
api_server: "https://localhost:1234"
namespaces:
names:
- default
basic_auth:
username: "myusername"
password_file: valid_password_file

- job_name: service-kuma
kuma_sd_configs:
- server: http://kuma-control-plane.kuma-system.svc:5676

- job_name: service-marathon
marathon_sd_configs:
- servers:
- "https://marathon.example.com:443"
auth_token: "mysecret"
tls_config:
cert_file: valid_cert_file
key_file: valid_key_file

- job_name: service-nomad
nomad_sd_configs:
- server: 'http://localhost:4646'

- job_name: service-ec2
ec2_sd_configs:
- region: us-east-1
access_key: access
secret_key: mysecret
profile: profile
filters:
- name: tag:environment
values:
- prod
- name: tag:service
values:
- web
- db

- job_name: service-lightsail
lightsail_sd_configs:
- region: us-east-1
access_key: access
secret_key: mysecret
profile: profile

- job_name: service-azure
azure_sd_configs:
- environment: AzurePublicCloud
authentication_method: OAuth
subscription_id: 11AAAA11-A11A-111A-A111-1111A1111A11
resource_group: my-resource-group
tenant_id: BBBB222B-B2B2-2B22-B222-2BB2222BB2B2
client_id: 333333CC-3C33-3333-CCC3-33C3CCCCC33C
client_secret: mysecret
port: 9100

- job_name: service-nerve
nerve_sd_configs:
- servers:
- localhost
paths:
- /monitoring

- job_name: 0123service-xxx
metrics_path: /metrics
static_configs:
- targets:
- localhost:9090

- job_name: badfederation
honor_timestamps: false
metrics_path: /federate
static_configs:
- targets:
- localhost:9090

- job_name: 測試
metrics_path: /metrics
static_configs:
- targets:
- localhost:9090

- job_name: httpsd
http_sd_configs:
- url: "http://example.com/prometheus"

- job_name: service-triton
triton_sd_configs:
- account: "testAccount"
dns_suffix: "triton.example.com"
endpoint: "triton.example.com"
port: 9163
refresh_interval: 1m
version: 1
tls_config:
cert_file: valid_cert_file
key_file: valid_key_file

- job_name: digitalocean-droplets
digitalocean_sd_configs:
- authorization:
credentials: abcdef

- job_name: docker
docker_sd_configs:
- host: unix:///var/run/docker.sock

- job_name: dockerswarm
dockerswarm_sd_configs:
- host: http://127.0.0.1:2375
role: nodes

- job_name: service-openstack
openstack_sd_configs:
- role: instance
region: RegionOne
port: 80
refresh_interval: 1m
tls_config:
ca_file: valid_ca_file
cert_file: valid_cert_file
key_file: valid_key_file

- job_name: service-puppetdb
puppetdb_sd_configs:
- url: https://puppetserver/
query: 'resources { type = "Package" and title = "httpd" }'
include_parameters: true
port: 80
refresh_interval: 1m
tls_config:
ca_file: valid_ca_file
cert_file: valid_cert_file
key_file: valid_key_file

- job_name: hetzner
relabel_configs:
- action: uppercase
source_labels: [instance]
target_label: instance
hetzner_sd_configs:
- role: hcloud
authorization:
credentials: abcdef
- role: robot
basic_auth:
username: abcdef
password: abcdef

- job_name: service-eureka
eureka_sd_configs:
- server: "http://eureka.example.com:8761/eureka"

- job_name: ovhcloud
ovhcloud_sd_configs:
- service: vps
endpoint: ovh-eu
application_key: testAppKey
application_secret: testAppSecret
consumer_key: testConsumerKey
refresh_interval: 1m
- service: dedicated_server
endpoint: ovh-eu
application_key: testAppKey
application_secret: testAppSecret
consumer_key: testConsumerKey
refresh_interval: 1m

- job_name: scaleway
scaleway_sd_configs:
- role: instance
project_id: 11111111-1111-1111-1111-111111111112
access_key: SCWXXXXXXXXXXXXXXXXX
secret_key: 11111111-1111-1111-1111-111111111111
- role: baremetal
project_id: 11111111-1111-1111-1111-111111111112
access_key: SCWXXXXXXXXXXXXXXXXX
secret_key: 11111111-1111-1111-1111-111111111111

- job_name: linode-instances
linode_sd_configs:
- authorization:
credentials: abcdef

- job_name: uyuni
uyuni_sd_configs:
- server: https://localhost:1234
username: gopher
password: hole

- job_name: ionos
ionos_sd_configs:
- datacenter_id: 8feda53f-15f0-447f-badf-ebe32dad2fc0
authorization:
credentials: abcdef

- job_name: vultr
vultr_sd_configs:
- authorization:
credentials: abcdef

tracing

tracing:
endpoint: "localhost:4317"
client_type: "grpc"
headers:
foo: "bar"
timeout: 5s
compression: "gzip"
tls_config:
cert_file: valid_cert_file
key_file: valid_key_file
insecure_skip_verify: true

storage

storage:
tsdb:
path:
retention:
time
size
wal-compression
out_of_order_time_window: 30m


remote_write:
- url: http://remote1/push
name: drop_expensive
write_relabel_configs:
- source_labels: [__name__]
regex: expensive.*
action: drop
oauth2:
client_id: "123"
client_secret: "456"
token_url: "http://remote1/auth"
tls_config:
cert_file: valid_cert_file
key_file: valid_key_file

- url: http://remote2/push
name: rw_tls
tls_config:
cert_file: valid_cert_file
key_file: valid_key_file
headers:
name: value

remote_read:
- url: http://remote1/read
read_recent: true
name: default
enable_http2: false
- url: http://remote3/read
read_recent: false
name: read_special
required_matchers:
job: special
tls_config:
cert_file: valid_cert_file
key_file: valid_key_file