Как я мониторинг разворачивал

10.05.23

Администрирование - Мониторинг

–

Статья об опыте развертывания системы мониторинга, сборе и агрегации технологического журнала.

Скачать исходный код

	Наименование	Файл	Версия	Размер
	monitoring.zip .zip 40,72Mb 16	.zip		40,72Mb	16	Скачать

Всем привет

Всем привет, меня зовут Андрей. Я программист 1С, работаю в отделе ИТ крупной коммерческой компании. Мой опыт в разработке 1С около восьми лет. Люблю сложные задачи, мемасы и всё новое.

С чего все начиналось

В моей компании роль основной учетной системы занимает довольно старая версия УТ 11, которая уже много лет не обновляется и в которой от кода вендора почти ничего не осталось. Думаю, ни для кого не секрет, что в таких ситуациях довольно часто возникают проблемы производительности, так как 90% времени пилим фичи, которые нужны были вчера.

2023 год стал исключением из правил, и одним из ключевых направлений развития для отдела разработки стала оптимизация. Основная цель оптимизации - это уменьшение времени выполнения ключевых операций. В частности, уменьшение времени проведения заказа покупателя.

За первые три месяца было проделано довольно много работы. Были достигнуты хорошие результаты, но мы все равно были далеки от желаемого. При этом была замечена особенность, что время проведения очень плавало в течении дня. Закономерностей мы выявить не смогли, а каких-то данных о нагрузках системы, железа, сети и т.д. у нас не было. В этот момент у руководителя отдела возникла идея организовать мониторинг, чтобы попытаться выявить причины возникновения этих скачков, да и в целом иметь хоть какое-то представление о работе системы.

Задача досталась мне и звучала примерно следующим образом: «надо организовать мониторинг основной системы, чтобы можно было понять, где у нас проблемы. В видео Оптимизация запросов в 7 ТБ базе 1С были показаны красивые дашборды с графиками, было бы здорово, если бы у нас было так же.»

Стек и первая версия проекта

Пересмотрев первую часть видео я увидел заветные графики. Выглядело всё действительно круто. Это была довольно известная Grafana о которой я немного слышал, но дела с ней никогда не имел. Но задача поставлено и делать нужно.

После прочтения нескольких небольших статей на Infostart я понял, в каком направлении мне нужно двигаться, и определился с начальным стеком технологий:

1. Grafana - основной инструмент для визуализации данных.

2. Prometheus - база данных временных рядов, для сбора и хранения метрик.

3. Docker - средство контейнеризации, для того, чтобы всё это запускалось с помощью магической команды

docker-compose up -d

Мне развернули небольшой сервер на Debian, я создал папку с проектом и начал работу. Не буду описывать все сотни проб и ошибок, поэтому просто скажу, что после изучения кучи официальной документации у меня получилась первая версия версия проекта, которая запускалась и работала. Prometheus был настроен на сбор метрик только с самого себя и, зайдя в Grafana, можно было уже пощупать инструмент и построить какие-либо графики.

Первая версия проекта

Структура проекта:

Docker:

docker-compose.yml

Думаю этот файл не нуждается в представлении. Большинство строк скопированы из официальной документации.

version: '3.8'
services:
  # Grafana
  grafana:
    image: grafana/grafana:9.4.3
    container_name: grafana
    restart: unless-stopped
    user: root
    volumes:
      # Файлы настроек сервиса grafana
      - ./grafana/conf:/etc/grafana
      # Хранилище данных grafana
      - ./grafana/grafana_data:/var/lib/grafana
    network_mode: "host"
  # Prometheus
  prometheus:
    image: prom/prometheus:v2.42.0
    container_name: prometheus
    restart: unless-stopped
    user: root
    volumes:
      # Файлы настройки сервиса prometheus
      - ./prometheus/conf:/etc/prometheus
      # Файлы БД prometheus
      - ./prometheus/prometheus_data:/prometheus
    command:
      - '--config.file=/etc/prometheus/prometheus.yml'
      - '--storage.tsdb.path=/prometheus'
      - '--web.console.libraries=/etc/prometheus/console_libraries'
      - '--web.console.templates=/etc/prometheus/consoles'
      - '--web.enable-lifecycle'
    network_mode: "host"

Grafana:

grafana/conf/grafana.ini

Основной файл настройки Grafana. Это стандартный файл, скопированный из официальной документации, в нем ничего не было изменено.

##################### Grafana Configuration Defaults #####################
#
# Do not modify this file in grafana installs
#
# possible values : production, development
app_mode = production
# instance name, defaults to HOSTNAME environment variable value or hostname if HOSTNAME var is empty
instance_name = ${HOSTNAME}
# force migration will run migrations that might cause dataloss
force_migration = false
#################################### Paths ###############################
[paths]
# Path to where grafana can store temp files, sessions, and the sqlite3 db (if that is used)
data = data
# Temporary files in `data` directory older than given duration will be removed
temp_data_lifetime = 24h
# Directory where grafana can store logs
logs = data/log
# Directory where grafana will automatically scan and look for plugins
plugins = data/plugins
# folder that contains provisioning config files that grafana will apply on startup and while running.
provisioning = conf/provisioning
#################################### Server ##############################
[server]
# Protocol (http, https, h2, socket)
protocol = http
# The ip address to bind to, empty will bind to all interfaces
http_addr =
# The http port to use
http_port = 3000
# The public facing domain name used to access grafana from a browser
domain = localhost
# Redirect to correct domain if host header does not match domain
# Prevents DNS rebinding attacks
enforce_domain = false
# The full public facing url
root_url = %(protocol)s://%(domain)s:%(http_port)s/
# Serve Grafana from subpath specified in `root_url` setting. By default it is set to `false` for compatibility reasons.
serve_from_sub_path = false
# Log web requests
router_logging = false
# the path relative working path
static_root_path = public
# enable gzip
enable_gzip = false
# https certs & key file
cert_file =
cert_key =
# Unix socket gid
# Changing the gid of a file without privileges requires that the target group is in the group of the process and that the process is the file owner
# It is recommended to set the gid as http server user gid
# Not set when the value is -1
socket_gid = -1
# Unix socket mode
socket_mode = 0660
# Unix socket path
socket = /tmp/grafana.sock
# CDN Url
cdn_url =
# Sets the maximum time in minutes before timing out read of an incoming request and closing idle connections.
# `0` means there is no timeout for reading the request.
read_timeout = 0
# This setting enables you to specify additional headers that the server adds to HTTP(S) responses.
[server.custom_response_headers]
#exampleHeader1 = exampleValue1
#exampleHeader2 = exampleValue2
#################################### Database ############################
[database]
# You can configure the database connection by specifying type, host, name, user and password
# as separate properties or as on string using the url property.
# Either "mysql", "postgres" or "sqlite3", it's your choice
type = sqlite3
host = 127.0.0.1:3306
name = grafana
user = root
# If the password contains # or ; you have to wrap it with triple quotes. Ex """#password;"""
password =
# Use either URL or the previous fields to configure the database
# Example: mysql://user:secret@host:port/database
url =
# Max idle conn setting default is 2
max_idle_conn = 2
# Max conn setting default is 0 (mean not set)
max_open_conn =
# Connection Max Lifetime default is 14400 (means 14400 seconds or 4 hours)
conn_max_lifetime = 14400
# Set to true to log the sql calls and execution times.
log_queries =
# For "postgres", use either "disable", "require" or "verify-full"
# For "mysql", use either "true", "false", or "skip-verify".
ssl_mode = disable
# Database drivers may support different transaction isolation levels.
# Currently, only "mysql" driver supports isolation levels.
# If the value is empty - driver's default isolation level is applied.
# For "mysql" use "READ-UNCOMMITTED", "READ-COMMITTED", "REPEATABLE-READ" or "SERIALIZABLE".
isolation_level =
ca_cert_path =
client_key_path =
client_cert_path =
server_cert_name =
# For "sqlite3" only, path relative to data_path setting
path = grafana.db
# For "sqlite3" only. cache mode setting used for connecting to the database
cache_mode = private
# For "sqlite3" only. Enable/disable Write-Ahead Logging, https://sqlite.org/wal.html. Default is false.
wal = false
# For "mysql" only if migrationLocking feature toggle is set. How many seconds to wait before failing to lock the database for the migrations, default is 0.
locking_attempt_timeout_sec = 0
# For "sqlite" only. How many times to retry query in case of database is locked failures. Default is 0 (disabled).
query_retries = 0
# For "sqlite" only. How many times to retry transaction in case of database is locked failures. Default is 5.
transaction_retries = 5
#################################### Cache server #############################
[remote_cache]
# Either "redis", "memcached" or "database" default is "database"
type = database
# cache connectionstring options
# database: will use Grafana primary database.
# redis: config like redis server e.g. `addr=127.0.0.1:6379,pool_size=100,db=0,ssl=false`. Only addr is required. ssl may be 'true', 'false', or 'insecure'.
# memcache: 127.0.0.1:11211
connstr =
# prefix prepended to all the keys in the remote cache
prefix =
# This enables encryption of values stored in the remote cache
encryption =
#################################### Data proxy ###########################
[dataproxy]
# This enables data proxy logging, default is false
logging = false
# How long the data proxy waits to read the headers of the response before timing out, default is 30 seconds.
# This setting also applies to core backend HTTP data sources where query requests use an HTTP client with timeout set.
timeout = 30
# How long the data proxy waits to establish a TCP connection before timing out, default is 10 seconds.
dialTimeout = 10
# How many seconds the data proxy waits before sending a keepalive request.
keep_alive_seconds = 30
# How many seconds the data proxy waits for a successful TLS Handshake before timing out.
tls_handshake_timeout_seconds = 10
# How many seconds the data proxy will wait for a server's first response headers after
# fully writing the request headers if the request has an "Expect: 100-continue"
# header. A value of 0 will result in the body being sent immediately, without
# waiting for the server to approve.
expect_continue_timeout_seconds = 1
# Optionally limits the total number of connections per host, including connections in the dialing,
# active, and idle states. On limit violation, dials will block.
# A value of zero (0) means no limit.
max_conns_per_host = 0
# The maximum number of idle connections that Grafana will keep alive.
max_idle_connections = 100
# How many seconds the data proxy keeps an idle connection open before timing out.
idle_conn_timeout_seconds = 90
# If enabled and user is not anonymous, data proxy will add X-Grafana-User header with username into the request.
send_user_header = false
# Limit the amount of bytes that will be read/accepted from responses of outgoing HTTP requests.
response_limit = 0
# Limits the number of rows that Grafana will process from SQL data sources.
row_limit = 1000000
#################################### Analytics ###########################
[analytics]
# Server reporting, sends usage counters to stats.grafana.org every 24 hours.
# No ip addresses are being tracked, only simple counters to track
# running instances, dashboard and error counts. It is very helpful to us.
# Change this option to false to disable reporting.
reporting_enabled = true
# The name of the distributor of the Grafana instance. Ex hosted-grafana, grafana-labs
reporting_distributor = grafana-labs
# Set to false to disable all checks to https://grafana.com
# for new versions of grafana. The check is used
# in some UI views to notify that a grafana update exists.
# This option does not cause any auto updates, nor send any information
# only a GET request to https://raw.githubusercontent.com/grafana/grafana/main/latest.json to get the latest version.
check_for_updates = true
# Set to false to disable all checks to https://grafana.com
# for new versions of plugins. The check is used
# in some UI views to notify that a plugin update exists.
# This option does not cause any auto updates, nor send any information
# only a GET request to https://grafana.com to get the latest versions.
check_for_plugin_updates = true
# Google Analytics universal tracking code, only enabled if you specify an id here
google_analytics_ua_id =
# Google Analytics 4 tracking code, only enabled if you specify an id here
google_analytics_4_id =
# When Google Analytics 4 Enhanced event measurement is enabled, we will try to avoid sending duplicate events and let Google Analytics 4 detect navigation changes, etc.
google_analytics_4_send_manual_page_views = false
# Google Tag Manager ID, only enabled if you specify an id here
google_tag_manager_id =
# Rudderstack write key, enabled only if rudderstack_data_plane_url is also set
rudderstack_write_key =
# Rudderstack data plane url, enabled only if rudderstack_write_key is also set
rudderstack_data_plane_url =
# Rudderstack SDK url, optional, only valid if rudderstack_write_key and rudderstack_data_plane_url is also set
rudderstack_sdk_url =
# Rudderstack Config url, optional, used by Rudderstack SDK to fetch source config
rudderstack_config_url =
# Application Insights connection string. Specify an URL string to enable this feature.
application_insights_connection_string =
# Optional. Specifies an Application Insights endpoint URL where the endpoint string is wrapped in backticks ``.
application_insights_endpoint_url =
# Controls if the UI contains any links to user feedback forms
feedback_links_enabled = true
#################################### Security ############################
[security]
# disable creation of admin user on first start of grafana
disable_initial_admin_creation = false
# default admin user, created on startup
admin_user = admin
# default admin password, can be changed before first start of grafana, or in profile settings
admin_password = admin
# default admin email, created on startup
admin_email = admin@localhost
# used for signing
secret_key = SW2YcwTIb9zpOOhoPsMm
# current key provider used for envelope encryption, default to static value specified by secret_key
encryption_provider = secretKey.v1
# list of configured key providers, space separated (Enterprise only): e.g., awskms.v1 azurekv.v1
available_encryption_providers =
# disable gravatar profile images
disable_gravatar = false
# data source proxy whitelist (ip_or_domain:port separated by spaces)
data_source_proxy_whitelist =
# disable protection against brute force login attempts
disable_brute_force_login_protection = false
# set to true if you host Grafana behind HTTPS. default is false.
cookie_secure = false
# set cookie SameSite attribute. defaults to `lax`. can be set to "lax", "strict", "none" and "disabled"
cookie_samesite = lax
# set to true if you want to allow browsers to render Grafana in a <frame>, <iframe>, <embed> or <object>. default is false.
allow_embedding = false
# Set to true if you want to enable http strict transport security (HSTS) response header.
# HSTS tells browsers that the site should only be accessed using HTTPS.
strict_transport_security = false
# Sets how long a browser should cache HSTS. Only applied if strict_transport_security is enabled.
strict_transport_security_max_age_seconds = 86400
# Set to true if to enable HSTS preloading option. Only applied if strict_transport_security is enabled.
strict_transport_security_preload = false
# Set to true if to enable the HSTS includeSubDomains option. Only applied if strict_transport_security is enabled.
strict_transport_security_subdomains = false
# Set to true to enable the X-Content-Type-Options response header.
# The X-Content-Type-Options response HTTP header is a marker used by the server to indicate that the MIME types advertised
# in the Content-Type headers should not be changed and be followed.
x_content_type_options = true
# Set to true to enable the X-XSS-Protection header, which tells browsers to stop pages from loading
# when they detect reflected cross-site scripting (XSS) attacks.
x_xss_protection = true
# Enable adding the Content-Security-Policy header to your requests.
# CSP allows to control resources the user agent is allowed to load and helps prevent XSS attacks.
content_security_policy = false
# Set Content Security Policy template used when adding the Content-Security-Policy header to your requests.
# $NONCE in the template includes a random nonce.
# $ROOT_PATH is server.root_url without the protocol.
content_security_policy_template = """script-src 'self' 'unsafe-eval' 'unsafe-inline' 'strict-dynamic' $NONCE;object-src 'none';font-src 'self';style-src 'self' 'unsafe-inline' blob:;img-src * data:;base-uri 'self';connect-src 'self' grafana.com ws://$ROOT_PATH wss://$ROOT_PATH;manifest-src 'self';media-src 'none';form-action 'self';"""
# Enable adding the Content-Security-Policy-Report-Only header to your requests.
# Allows you to monitor the effects of a policy without enforcing it.
content_security_policy_report_only = false
# Set Content Security Policy Report Only template used when adding the Content-Security-Policy-Report-Only header to your requests.
# $NONCE in the template includes a random nonce.
# $ROOT_PATH is server.root_url without the protocol.
content_security_policy_report_only_template = """script-src 'self' 'unsafe-eval' 'unsafe-inline' 'strict-dynamic' $NONCE;object-src 'none';font-src 'self';style-src 'self' 'unsafe-inline' blob:;img-src * data:;base-uri 'self';connect-src 'self' grafana.com ws://$ROOT_PATH wss://$ROOT_PATH;manifest-src 'self';media-src 'none';form-action 'self';"""
# Controls if old angular plugins are supported or not. This will be disabled by default in future release
angular_support_enabled = true
[security.encryption]
# Defines the time-to-live (TTL) for decrypted data encryption keys stored in memory (cache).
# Please note that small values may cause performance issues due to a high frequency decryption operations.
data_keys_cache_ttl = 15m
# Defines the frequency of data encryption keys cache cleanup interval.
# On every interval, decrypted data encryption keys that reached the TTL are removed from the cache.
data_keys_cache_cleanup_interval = 1m
#################################### Snapshots ###########################
[snapshots]
# set to false to remove snapshot functionality
enabled = true
# snapshot sharing options
external_enabled = true
external_snapshot_url = https://snapshots.raintank.io
external_snapshot_name = Publish to snapshots.raintank.io
# Set to true to enable this Grafana instance act as an external snapshot server and allow unauthenticated requests for
# creating and deleting snapshots.
public_mode = false
# remove expired snapshot
snapshot_remove_expired = true
#################################### Dashboards ##################
[dashboards]
# Number dashboard versions to keep (per dashboard). Default: 20, Minimum: 1
versions_to_keep = 20
# Minimum dashboard refresh interval. When set, this will restrict users to set the refresh interval of a dashboard lower than given interval. Per default this is 5 seconds.
# The interval string is a possibly signed sequence of decimal numbers, followed by a unit suffix (ms, s, m, h, d), e.g. 30s or 1m.
min_refresh_interval = 5s
# Path to the default home dashboard. If this value is empty, then Grafana uses StaticRootPath + "dashboards/home.json"
default_home_dashboard_path =
################################### Data sources #########################
[datasources]
# Upper limit of data sources that Grafana will return. This limit is a temporary configuration and it will be deprecated when pagination will be introduced on the list data sources API.
datasource_limit = 5000
#################################### Users ###############################
[users]
# disable user signup / registration
allow_sign_up = false
# Allow non admin users to create organizations
allow_org_create = false
# Set to true to automatically assign new users to the default organization (id 1)
auto_assign_org = true
# Set this value to automatically add new users to the provided organization (if auto_assign_org above is set to true)
auto_assign_org_id = 1
# Default role new users will be automatically assigned (if auto_assign_org above is set to true)
auto_assign_org_role = Viewer
# Require email validation before sign up completes
verify_email_enabled = false
# Background text for the user field on the login page
login_hint = email or username
password_hint = password
# Default UI theme ("dark" or "light" or "system")
default_theme = dark
# Default UI language (supported IETF language tag, such as en-US)
default_language = en-US
# Path to a custom home page. Users are only redirected to this if the default home dashboard is used. It should match a frontend route and contain a leading slash.
home_page =
# External user management
external_manage_link_url =
external_manage_link_name =
external_manage_info =
# Viewers can edit/inspect dashboard settings in the browser. But not save the dashboard.
viewers_can_edit = false
# Editors can administrate dashboard, folders and teams they create
editors_can_admin = false
# The duration in time a user invitation remains valid before expiring. This setting should be expressed as a duration. Examples: 6h (hours), 2d (days), 1w (week). Default is 24h (24 hours). The minimum supported duration is 15m (15 minutes).
user_invite_max_lifetime_duration = 24h
# Enter a comma-separated list of usernames to hide them in the Grafana UI. These users are shown to Grafana admins and to themselves.
hidden_users =
[service_accounts]
# When set, Grafana will not allow the creation of tokens with expiry greater than this setting.
token_expiration_day_limit =
[auth]
# Login cookie name
login_cookie_name = grafana_session
# Disable usage of Grafana build-in login solution.
disable_login = false
# The maximum lifetime (duration) an authenticated user can be inactive before being required to login at next visit. Default is 7 days (7d). This setting should be expressed as a duration, e.g. 5m (minutes), 6h (hours), 10d (days), 2w (weeks), 1M (month). The lifetime resets at each successful token rotation (token_rotation_interval_minutes).
login_maximum_inactive_lifetime_duration =
# The maximum lifetime (duration) an authenticated user can be logged in since login time before being required to login. Default is 30 days (30d). This setting should be expressed as a duration, e.g. 5m (minutes), 6h (hours), 10d (days), 2w (weeks), 1M (month).
login_maximum_lifetime_duration =
# How often should auth tokens be rotated for authenticated users when being active. The default is each 10 minutes.
token_rotation_interval_minutes = 10
# Set to true to disable (hide) the login form, useful if you use OAuth
disable_login_form = false
# Set to true to disable the sign out link in the side menu. Useful if you use auth.proxy or auth.jwt.
disable_signout_menu = false
# URL to redirect the user to after sign out
signout_redirect_url =
# Set to true to attempt login with OAuth automatically, skipping the login screen.
# This setting is ignored if multiple OAuth providers are configured.
# Deprecated, use auto_login option for specific provider instead.
oauth_auto_login = false
# OAuth state max age cookie duration in seconds. Defaults to 600 seconds.
oauth_state_cookie_max_age = 600
# Skip forced assignment of OrgID 1 or 'auto_assign_org_id' for social logins
oauth_skip_org_role_update_sync = false
# limit of api_key seconds to live before expiration
api_key_max_seconds_to_live = -1
# Set to true to enable SigV4 authentication option for HTTP-based datasources
sigv4_auth_enabled = false
# Set to true to enable verbose logging of SigV4 request signing
sigv4_verbose_logging = false
# Set to true to enable Azure authentication option for HTTP-based datasources
azure_auth_enabled = false
#################################### Anonymous Auth ######################
[auth.anonymous]
# enable anonymous access
enabled = false
# specify organization name that should be used for unauthenticated users
org_name = Main Org.
# specify role for unauthenticated users
org_role = Viewer
# mask the Grafana version number for unauthenticated users
hide_version = false
#################################### GitHub Auth #########################
[auth.github]
enabled = false
allow_sign_up = true
auto_login = false
client_id = some_id
client_secret =
scopes = user:email,read:org
auth_url = https://github.com/login/oauth/authorize
token_url = https://github.com/login/oauth/access_token
api_url = https://api.github.com/user
allowed_domains =
team_ids =
allowed_organizations =
role_attribute_path =
role_attribute_strict = false
allow_assign_grafana_admin = false
#################################### GitLab Auth #########################
[auth.gitlab]
enabled = false
allow_sign_up = true
auto_login = false
client_id = some_id
client_secret =
scopes = api
auth_url = https://gitlab.com/oauth/authorize
token_url = https://gitlab.com/oauth/token
api_url = https://gitlab.com/api/v4
allowed_domains =
allowed_groups =
role_attribute_path =
role_attribute_strict = false
allow_assign_grafana_admin = false
skip_org_role_sync = false
#################################### Google Auth #########################
[auth.google]
enabled = false
allow_sign_up = true
auto_login = false
client_id = some_client_id
client_secret =
scopes = https://www.googleapis.com/auth/userinfo.profile https://www.googleapis.com/auth/userinfo.email
auth_url = https://accounts.google.com/o/oauth2/auth
token_url = https://accounts.google.com/o/oauth2/token
api_url = https://www.googleapis.com/oauth2/v1/userinfo
allowed_domains =
hosted_domain =
skip_org_role_sync = false
#################################### Grafana.com Auth ####################
# legacy key names (so they work in env variables)
[auth.grafananet]
enabled = false
allow_sign_up = true
client_id = some_id
client_secret =
scopes = user:email
allowed_organizations =
[auth.grafana_com]
enabled = false
allow_sign_up = true
auto_login = false
client_id = some_id
client_secret =
scopes = user:email
allowed_organizations =
skip_org_role_sync = false
#################################### Azure AD OAuth #######################
[auth.azuread]
name = Azure AD
enabled = false
allow_sign_up = true
auto_login = false
client_id = some_client_id
client_secret =
scopes = openid email profile
auth_url = https://login.microsoftonline.com/<tenant-id>/oauth2/v2.0/authorize
token_url = https://login.microsoftonline.com/<tenant-id>/oauth2/v2.0/token
allowed_domains =
allowed_groups =
role_attribute_strict = false
allow_assign_grafana_admin = false
force_use_graph_api = false
#################################### Okta OAuth #######################
[auth.okta]
name = Okta
icon = okta
enabled = false
allow_sign_up = true
auto_login = false
client_id = some_id
client_secret =
scopes = openid profile email groups
auth_url = https://<tenant-id>.okta.com/oauth2/v1/authorize
token_url = https://<tenant-id>.okta.com/oauth2/v1/token
api_url = https://<tenant-id>.okta.com/oauth2/v1/userinfo
allowed_domains =
allowed_groups =
role_attribute_path =
role_attribute_strict = false
allow_assign_grafana_admin = false
skip_org_role_sync = false
#################################### Generic OAuth #######################
[auth.generic_oauth]
name = OAuth
icon = signin
enabled = false
allow_sign_up = true
auto_login = false
client_id = some_id
client_secret =
scopes = user:email
empty_scopes = false
email_attribute_name = email:primary
email_attribute_path =
login_attribute_path =
name_attribute_path =
role_attribute_path =
role_attribute_strict = false
groups_attribute_path =
id_token_attribute_name =
team_ids_attribute_path =
auth_url =
token_url =
api_url =
teams_url =
allowed_domains =
team_ids =
allowed_organizations =
tls_skip_verify_insecure = false
tls_client_cert =
tls_client_key =
tls_client_ca =
use_pkce = false
auth_style =
allow_assign_grafana_admin = false
#################################### Basic Auth ##########################
[auth.basic]
enabled = true
#################################### Auth Proxy ##########################
[auth.proxy]
enabled = false
header_name = X-WEBAUTH-USER
header_property = username
auto_sign_up = true
sync_ttl = 60
whitelist =
headers =
headers_encoded = false
enable_login_token = false
#################################### Auth JWT ##########################
[auth.jwt]
enabled = false
enable_login_token = false
header_name =
email_claim =
username_claim =
jwk_set_url =
jwk_set_file =
cache_ttl = 60m
expect_claims = {}
key_file =
role_attribute_path =
role_attribute_strict = false
auto_sign_up = false
url_login = false
allow_assign_grafana_admin = false
skip_org_role_sync = false
#################################### Auth LDAP ###########################
[auth.ldap]
enabled = false
config_file = /etc/grafana/ldap.toml
allow_sign_up = true
skip_org_role_sync = false
# LDAP background sync (Enterprise only)
# At 1 am every day
sync_cron = "0 1 * * *"
active_sync_enabled = true
#################################### AWS ###########################
[aws]
# Enter a comma-separated list of allowed AWS authentication providers.
# Options are: default (AWS SDK Default), keys (Access && secret key), credentials (Credentials field), ec2_iam_role (EC2 IAM Role)
allowed_auth_providers = default,keys,credentials
# Allow AWS users to assume a role using temporary security credentials.
# If true, assume role will be enabled for all AWS authentication providers that are specified in aws_auth_providers
assume_role_enabled = true
# Specify max no of pages to be returned by the ListMetricPages API
list_metrics_page_limit = 500
#################################### Azure ###############################
[azure]
# Azure cloud environment where Grafana is hosted
# Possible values are AzureCloud, AzureChinaCloud, AzureUSGovernment and AzureGermanCloud
# Default value is AzureCloud (i.e. public cloud)
cloud = AzureCloud
# Specifies whether Grafana hosted in Azure service with Managed Identity configured (e.g. Azure Virtual Machines instance)
# If enabled, the managed identity can be used for authentication of Grafana in Azure services
# Disabled by default, needs to be explicitly enabled
managed_identity_enabled = false
# Client ID to use for user-assigned managed identity
# Should be set for user-assigned identity and should be empty for system-assigned identity
managed_identity_client_id =
#################################### Role-based Access Control ###########
[rbac]
# If enabled, cache permissions in a in memory cache
permission_cache = true
# Reset basic roles permissions on boot
# Warning left to true, basic roles permissions will be reset on every boot
reset_basic_roles = false
#################################### SMTP / Emailing #####################
[smtp]
enabled = false
host = localhost:25
user =
# If the password contains # or ; you have to wrap it with triple quotes. Ex """#password;"""
password =
cert_file =
key_file =
skip_verify = false
from_address = admin@grafana.localhost
from_name = Grafana
ehlo_identity =
startTLS_policy =
[emails]
welcome_email_on_sign_up = false
templates_pattern = emails/*.html, emails/*.txt
content_types = text/html
#################################### Logging ##########################
[log]
# Either "console", "file", "syslog". Default is console and file
# Use space to separate multiple modes, e.g. "console file"
mode = console file
# Either "debug", "info", "warn", "error", "critical", default is "info"
level = info
# optional settings to set different levels for specific loggers. Ex filters = sqlstore:debug
filters =
# For "console" mode only
[log.console]
level =
# log line format, valid options are text, console and json
format = console
# For "file" mode only
[log.file]
level =
# log line format, valid options are text, console and json
format = text
# This enables automated log rotate(switch of following options), default is true
log_rotate = true
# Max line number of single file, default is 1000000
max_lines = 1000000
# Max size shift of single file, default is 28 means 1 << 28, 256MB
max_size_shift = 28
# Segment log daily, default is true
daily_rotate = true
# Expired days of log file(delete after max days), default is 7
max_days = 7
[log.syslog]
level =
# log line format, valid options are text, console and json
format = text
# Syslog network type and address. This can be udp, tcp, or unix. If left blank, the default unix endpoints will be used.
network =
address =
# Syslog facility. user, daemon and local0 through local7 are valid.
facility =
# Syslog tag. By default, the process' argv[0] is used.
tag =
[log.frontend]
# Should Sentry javascript agent be initialized
enabled = false
# Defines which provider to use sentry or grafana
provider = sentry
# Sentry DSN if you want to send events to Sentry.
sentry_dsn =
# Custom HTTP endpoint to send events to. Default will log the events to stdout.
custom_endpoint =
# Rate of events to be reported to Sentry between 0 (none) and 1 (all), float
sample_rate = 1.0
# Requests per second limit enforced per an extended period, for Grafana backend log ingestion endpoint (/log).
log_endpoint_requests_per_second_limit = 3
# Max requests accepted per short interval of time for Grafana backend log ingestion endpoint (/log)
log_endpoint_burst_limit = 15
# Should error instrumentation be enabled, only affects Grafana Javascript Agent
instrumentations_errors_enabled = true
# Should console instrumentation be enabled, only affects Grafana Javascript Agent
instrumentations_console_enabled = false
# Should webvitals instrumentation be enabled, only affects Grafana Javascript Agent
instrumentations_webvitals_enabled = false
# Api Key, only applies to Grafana Javascript Agent provider
api_key =
#################################### Usage Quotas ########################
[quota]
enabled = false
#### set quotas to -1 to make unlimited. ####
# limit number of users per Org.
org_user = 10
# limit number of dashboards per Org.
org_dashboard = 100
# limit number of data_sources per Org.
org_data_source = 10
# limit number of api_keys per Org.
org_api_key = 10
# limit number of alerts per Org.
org_alert_rule = 100
# limit number of orgs a user can create.
user_org = 10
# Global limit of users.
global_user = -1
# global limit of orgs.
global_org = -1
# global limit of dashboards
global_dashboard = -1
# global limit of api_keys
global_api_key = -1
# global limit on number of logged in users.
global_session = -1
# global limit of alerts
global_alert_rule = -1
# global limit of files uploaded to the SQL DB
global_file = 1000
#################################### Unified Alerting ####################
[unified_alerting]
# Enable the Unified Alerting sub-system and interface. When enabled we'll migrate all of your alert rules and notification channels to the new system. New alert rules will be created and your notification channels will be converted into an Alertmanager configuration. Previous data is preserved to enable backwards compatibility but new data is removed when switching. When this configuration section and flag are not defined, the state is defined at runtime. See the documentation for more details.
enabled =
# Comma-separated list of organization IDs for which to disable unified alerting. Only supported if unified alerting is enabled.
disabled_orgs =
# Specify the frequency of polling for admin config changes.
# The interval string is a possibly signed sequence of decimal numbers, followed by a unit suffix (ms, s, m, h, d), e.g. 30s or 1m.
admin_config_poll_interval = 60s
# Specify the frequency of polling for Alertmanager config changes.
# The interval string is a possibly signed sequence of decimal numbers, followed by a unit suffix (ms, s, m, h, d), e.g. 30s or 1m.
alertmanager_config_poll_interval = 60s
# Listen address/hostname and port to receive unified alerting messages for other Grafana instances. The port is used for both TCP and UDP. It is assumed other Grafana instances are also running on the same port.
ha_listen_address = "0.0.0.0:9094"
# Explicit address/hostname and port to advertise other Grafana instances. The port is used for both TCP and UDP.
ha_advertise_address = ""
# Comma-separated list of initial instances (in a format of host:port) that will form the HA cluster. Configuring this setting will enable High Availability mode for alerting.
ha_peers = ""
# Time to wait for an instance to send a notification via the Alertmanager. In HA, each Grafana instance will
# be assigned a position (e.g. 0, 1). We then multiply this position with the timeout to indicate how long should
# each instance wait before sending the notification to take into account replication lag.
# The interval string is a possibly signed sequence of decimal numbers, followed by a unit suffix (ms, s, m, h, d), e.g. 30s or 1m.
ha_peer_timeout = 15s
# The interval between sending gossip messages. By lowering this value (more frequent) gossip messages are propagated
# across cluster more quickly at the expense of increased bandwidth usage.
# The interval string is a possibly signed sequence of decimal numbers, followed by a unit suffix (ms, s, m, h, d), e.g. 30s or 1m.
ha_gossip_interval = 200ms
# The interval between gossip full state syncs. Setting this interval lower (more frequent) will increase convergence speeds
# across larger clusters at the expense of increased bandwidth usage.
# The interval string is a possibly signed sequence of decimal numbers, followed by a unit suffix (ms, s, m, h, d), e.g. 30s or 1m.
ha_push_pull_interval = 60s
# Enable or disable alerting rule execution. The alerting UI remains visible. This option has a legacy version in the `[alerting]` section that takes precedence.
execute_alerts = true
# Alert evaluation timeout when fetching data from the datasource. This option has a legacy version in the `[alerting]` section that takes precedence.
# The timeout string is a possibly signed sequence of decimal numbers, followed by a unit suffix (ms, s, m, h, d), e.g. 30s or 1m.
evaluation_timeout = 30s
# Number of times we'll attempt to evaluate an alert rule before giving up on that evaluation. This option has a legacy version in the `[alerting]` section that takes precedence.
max_attempts = 3
# Minimum interval to enforce between rule evaluations. Rules will be adjusted if they are less than this value or if they are not multiple of the scheduler interval (10s). Higher values can help with resource management as we'll schedule fewer evaluations over time. This option has a legacy version in the `[alerting]` section that takes precedence.
# The interval string is a possibly signed sequence of decimal numbers, followed by a unit suffix (ms, s, m, h, d), e.g. 30s or 1m.
min_interval = 10s
[unified_alerting.screenshots]
# Enable screenshots in notifications. This option requires the Grafana Image Renderer plugin.
# For more information on configuration options, refer to [rendering].
capture = false
# The timeout for capturing screenshots. If a screenshot cannot be captured within the timeout then
# the notification is sent without a screenshot. The maximum duration is 30 seconds. This timeout
# should be less than the minimum Interval of all Evaluation Groups to avoid back pressure on alert
# rule evaluation.
capture_timeout = 10s
# The maximum number of screenshots that can be taken at the same time. This option is different from
# concurrent_render_request_limit as max_concurrent_screenshots sets the number of concurrent screenshots
# that can be taken at the same time for all firing alerts where as concurrent_render_request_limit sets
# the total number of concurrent screenshots across all Grafana services.
max_concurrent_screenshots = 5
# Uploads screenshots to the local Grafana server or remote storage such as Azure, S3 and GCS. Please
# see [external_image_storage] for further configuration options. If this option is false then
# screenshots will be persisted to disk for up to temp_data_lifetime.
upload_external_image_storage = false
[unified_alerting.reserved_labels]
# Comma-separated list of reserved labels added by the Grafana Alerting engine that should be disabled.
# For example: `disabled_labels=grafana_folder`
disabled_labels =
[unified_alerting.state_history]
# Enable the state history functionality in Unified Alerting. The previous states of alert rules will be visible in panels and in the UI.
enabled = true
#################################### Alerting ############################
[alerting]
# Enable the legacy alerting sub-system and interface. If Unified Alerting is already enabled and you try to go back to legacy alerting, all data that is part of Unified Alerting will be deleted. When this configuration section and flag are not defined, the state is defined at runtime. See the documentation for more details.
enabled =
# Makes it possible to turn off alert execution but alerting UI is visible
execute_alerts = true
# Default setting for new alert rules. Defaults to categorize error and timeouts as alerting. (alerting, keep_state)
error_or_timeout = alerting
# Default setting for how Grafana handles nodata or null values in alerting. (alerting, no_data, keep_state, ok)
nodata_or_nullvalues = no_data
# Alert notifications can include images, but rendering many images at the same time can overload the server
# This limit will protect the server from render overloading and make sure notifications are sent out quickly
concurrent_render_limit = 5
# Default setting for alert calculation timeout. Default value is 30
evaluation_timeout_seconds = 30
# Default setting for alert notification timeout. Default value is 30
notification_timeout_seconds = 30
# Default setting for max attempts to sending alert notifications. Default value is 3
max_attempts = 3
# Makes it possible to enforce a minimal interval between evaluations, to reduce load on the backend
min_interval_seconds = 1
# Configures for how long alert annotations are stored. Default is 0, which keeps them forever.
# This setting should be expressed as an duration. Ex 6h (hours), 10d (days), 2w (weeks), 1M (month).
max_annotation_age =
# Configures max number of alert annotations that Grafana stores. Default value is 0, which keeps all alert annotations.
max_annotations_to_keep =
#################################### Annotations #########################
[annotations]
# Configures the batch size for the annotation clean-up job. This setting is used for dashboard, API, and alert annotations.
cleanupjob_batchsize = 100
# Enforces the maximum allowed length of the tags for any newly introduced annotations. It can be between 500 and 4096 inclusive (which is the respective's column length). Default value is 500.
# Setting it to a higher value would impact performance therefore is not recommended.
tags_length = 500
[annotations.dashboard]
# Dashboard annotations means that annotations are associated with the dashboard they are created on.
# Configures how long dashboard annotations are stored. Default is 0, which keeps them forever.
# This setting should be expressed as a duration. Examples: 6h (hours), 10d (days), 2w (weeks), 1M (month).
max_age =
# Configures max number of dashboard annotations that Grafana stores. Default value is 0, which keeps all dashboard annotations.
max_annotations_to_keep =
[annotations.api]
# API annotations means that the annotations have been created using the API without any
# association with a dashboard.
# Configures how long Grafana stores API annotations. Default is 0, which keeps them forever.
# This setting should be expressed as a duration. Examples: 6h (hours), 10d (days), 2w (weeks), 1M (month).
max_age =
# Configures max number of API annotations that Grafana keeps. Default value is 0, which keeps all API annotations.
max_annotations_to_keep =
#################################### Explore #############################
[explore]
# Enable the Explore section
enabled = true
#################################### Help #############################
[help]
# Enable the Help section
enabled = true
#################################### Profile #############################
[profile]
# Enable the Profile section
enabled = true
#################################### Query History #############################
[query_history]
# Enable the Query history
enabled = true
#################################### Internal Grafana Metrics ############
# Metrics available at HTTP URL /metrics and /metrics/plugins/:pluginId
[metrics]
enabled              = true
interval_seconds     = 10
# Disable total stats (stat_totals_*) metrics to be generated
disable_total_stats = false
#If both are set, basic auth will be required for the metrics endpoints.
basic_auth_username =
basic_auth_password =
# Metrics environment info adds dimensions to the `grafana_environment_info` metric, which
# can expose more information about the Grafana instance.
[metrics.environment_info]
#exampleLabel1 = exampleValue1
#exampleLabel2 = exampleValue2
# Send internal Grafana metrics to graphite
[metrics.graphite]
# Enable by setting the address setting (ex localhost:2003)
address =
prefix = prod.grafana.%(instance_name)s.
#################################### Grafana.com integration  ##########################
[grafana_net]
url = https://grafana.com
[grafana_com]
url = https://grafana.com
api_url = https://grafana.com/api
#################################### Distributed tracing ############
# Opentracing is deprecated use opentelemetry instead
[tracing.jaeger]
# jaeger destination (ex localhost:6831)
address =
# tag that will always be included in when creating new spans. ex (tag1:value1,tag2:value2)
always_included_tag =
# Type specifies the type of the sampler: const, probabilistic, rateLimiting, or remote
sampler_type = const
# jaeger samplerconfig param
# for "const" sampler, 0 or 1 for always false/true respectively
# for "probabilistic" sampler, a probability between 0 and 1
# for "rateLimiting" sampler, the number of spans per second
# for "remote" sampler, param is the same as for "probabilistic"
# and indicates the initial sampling rate before the actual one
# is received from the mothership
sampler_param = 1
# sampling_server_url is the URL of a sampling manager providing a sampling strategy.
sampling_server_url =
# Whether or not to use Zipkin span propagation (x-b3- HTTP headers).
zipkin_propagation = false
# Setting this to true disables shared RPC spans.
# Not disabling is the most common setting when using Zipkin elsewhere in your infrastructure.
disable_shared_zipkin_spans = false
[tracing.opentelemetry]
# attributes that will always be included in when creating new spans. ex (key1:value1,key2:value2)
custom_attributes =
[tracing.opentelemetry.jaeger]
# jaeger destination (ex http://localhost:14268/api/traces)
address =
# Propagation specifies the text map propagation format: w3c, jaeger
propagation =
# This is a configuration for OTLP exporter with GRPC protocol
[tracing.opentelemetry.otlp]
# otlp destination (ex localhost:4317)
address =
# Propagation specifies the text map propagation format: w3c, jaeger
propagation =
#################################### External Image Storage ##############
[external_image_storage]
# Used for uploading images to public servers so they can be included in slack/email messages.
# You can choose between (s3, webdav, gcs, azure_blob, local)
provider =
[external_image_storage.s3]
endpoint =
path_style_access =
bucket_url =
bucket =
region =
path =
access_key =
secret_key =
[external_image_storage.webdav]
url =
username =
password =
public_url =
[external_image_storage.gcs]
key_file =
bucket =
path =
enable_signed_urls = false
signed_url_expiration =
[external_image_storage.azure_blob]
account_name =
account_key =
container_name =
sas_token_expiration_days =
[external_image_storage.local]
# does not require any configuration
[rendering]
# Options to configure a remote HTTP image rendering service, e.g. using https://github.com/grafana/grafana-image-renderer.
# URL to a remote HTTP image renderer service, e.g. http://localhost:8081/render, will enable Grafana to render panels and dashboards to PNG-images using HTTP requests to an external service.
server_url =
# If the remote HTTP image renderer service runs on a different server than the Grafana server you may have to configure this to a URL where Grafana is reachable, e.g. http://grafana.domain/.
callback_url =
# An auth token that will be sent to and verified by the renderer. The renderer will deny any request without an auth token matching the one configured on the renderer side.
renderer_token = -
# Concurrent render request limit affects when the /render HTTP endpoint is used. Rendering many images at the same time can overload the server,
# which this setting can help protect against by only allowing a certain amount of concurrent requests.
concurrent_render_request_limit = 30
# Determines the lifetime of the render key used by the image renderer to access and render Grafana.
# This setting should be expressed as a duration. Examples: 10s (seconds), 5m (minutes), 2h (hours).
# Default is 5m. This should be more than enough for most deployments.
# Change the value only if image rendering is failing and you see `Failed to get the render key from cache` in Grafana logs.
render_key_lifetime = 5m
[panels]
# here for to support old env variables, can remove after a few months
enable_alpha = false
disable_sanitize_html = false
[plugins]
enable_alpha = false
app_tls_skip_verify_insecure = false
# Enter a comma-separated list of plugin identifiers to identify plugins to load even if they are unsigned. Plugins with modified signatures are never loaded.
allow_loading_unsigned_plugins =
# Enable or disable installing / uninstalling / updating plugins directly from within Grafana.
plugin_admin_enabled = true
plugin_admin_external_manage_enabled = false
plugin_catalog_url = https://grafana.com/grafana/plugins/
# Enter a comma-separated list of plugin identifiers to hide in the plugin catalog.
plugin_catalog_hidden_plugins =
#################################### Grafana Live ##########################################
[live]
# max_connections to Grafana Live WebSocket endpoint per Grafana server instance. See Grafana Live docs
# if you are planning to make it higher than default 100 since this can require some OS and infrastructure
# tuning. 0 disables Live, -1 means unlimited connections.
max_connections = 100
# allowed_origins is a comma-separated list of origins that can establish connection with Grafana Live.
# If not set then origin will be matched over root_url. Supports wildcard symbol "*".
allowed_origins =
# engine defines an HA (high availability) engine to use for Grafana Live. By default no engine used - in
# this case Live features work only on a single Grafana server.
# Available options: "redis".
# Setting ha_engine is an EXPERIMENTAL feature.
ha_engine =
# ha_engine_address sets a connection address for Live HA engine. Depending on engine type address format can differ.
# For now we only support Redis connection address in "host:port" format.
# This option is EXPERIMENTAL.
ha_engine_address = "127.0.0.1:6379"
#################################### Grafana Image Renderer Plugin ##########################
[plugin.grafana-image-renderer]
# Instruct headless browser instance to use a default timezone when not provided by Grafana, e.g. when rendering panel image of alert.
# See ICU’s metaZones.txt (https://cs.chromium.org/chromium/src/third_party/icu/source/data/misc/metaZones.txt) for a list of supported
# timezone IDs. Fallbacks to TZ environment variable if not set.
rendering_timezone =
# Instruct headless browser instance to use a default language when not provided by Grafana, e.g. when rendering panel image of alert.
# Please refer to the HTTP header Accept-Language to understand how to format this value, e.g. 'fr-CH, fr;q=0.9, en;q=0.8, de;q=0.7, *;q=0.5'.
rendering_language =
# Instruct headless browser instance to use a default device scale factor when not provided by Grafana, e.g. when rendering panel image of alert.
# Default is 1. Using a higher value will produce more detailed images (higher DPI), but will require more disk space to store an image.
rendering_viewport_device_scale_factor =
# Instruct headless browser instance whether to ignore HTTPS errors during navigation. Per default HTTPS errors are not ignored. Due to
# the security risk it's not recommended to ignore HTTPS errors.
rendering_ignore_https_errors =
# Instruct headless browser instance whether to capture and log verbose information when rendering an image. Default is false and will
# only capture and log error messages. When enabled, debug messages are captured and logged as well.
# For the verbose information to be included in the Grafana server log you have to adjust the rendering log level to debug, configure
# [log].filter = rendering:debug.
rendering_verbose_logging =
# Instruct headless browser instance whether to output its debug and error messages into running process of remote rendering service.
# Default is false. This can be useful to enable (true) when troubleshooting.
rendering_dumpio =
# Additional arguments to pass to the headless browser instance. Default is --no-sandbox. The list of Chromium flags can be found
# here (https://peter.sh/experiments/chromium-command-line-switches/). Multiple arguments is separated with comma-character.
rendering_args =
# You can configure the plugin to use a different browser binary instead of the pre-packaged version of Chromium.
# Please note that this is not recommended, since you may encounter problems if the installed version of Chrome/Chromium is not
# compatible with the plugin.
rendering_chrome_bin =
# Instruct how headless browser instances are created. Default is 'default' and will create a new browser instance on each request.
# Mode 'clustered' will make sure that only a maximum of browsers/incognito pages can execute concurrently.
# Mode 'reusable' will have one browser instance and will create a new incognito page on each request.
rendering_mode =
# When rendering_mode = clustered, you can instruct how many browsers or incognito pages can execute concurrently. Default is 'browser'
# and will cluster using browser instances.
# Mode 'context' will cluster using incognito pages.
rendering_clustering_mode =
# When rendering_mode = clustered, you can define the maximum number of browser instances/incognito pages that can execute concurrently. Default is '5'.
rendering_clustering_max_concurrency =
# When rendering_mode = clustered, you can specify the duration a rendering request can take before it will time out. Default is `30` seconds.
rendering_clustering_timeout =
# Limit the maximum viewport width, height and device scale factor that can be requested.
rendering_viewport_max_width =
rendering_viewport_max_height =
rendering_viewport_max_device_scale_factor =
# Change the listening host and port of the gRPC server. Default host is 127.0.0.1 and default port is 0 and will automatically assign
# a port not in use.
grpc_host =
grpc_port =
[enterprise]
license_path =
[feature_toggles]
# there are currently two ways to enable feature toggles in the `grafana.ini`.
# you can either pass an array of feature you want to enable to the `enable` field or
# configure each toggle by setting the name of the toggle to true/false. Toggles set to true/false
# will take precedence over toggles in the `enable` list.
# enable = feature1,feature2
enable =
# Some features are enabled by default, see:
# https://grafana.com/docs/grafana/next/setup-grafana/configure-grafana/feature-toggles/
# To enable features by default, set `Expression:  "true"` in:
# https://github.com/grafana/grafana/blob/main/pkg/services/featuremgmt/registry.go
# feature1 = true
# feature2 = false
[date_formats]
# For information on what formatting patterns that are supported https://momentjs.com/docs/#/displaying/
# Default system date format used in time range picker and other places where full time is displayed
full_date = DD-MM-YYYY HH:mm:ss
# Used by graph and other places where we only show small intervals
interval_second = HH:mm:ss
interval_minute = HH:mm
interval_hour = DD/MM HH:mm
interval_day = DD/MM
interval_month = MM-YYYY
interval_year = YYYY
# Experimental feature
use_browser_locale = false
# Default timezone for user preferences. Options are 'browser' for the browser local timezone or a timezone name from IANA Time Zone database, e.g. 'UTC' or 'Europe/Amsterdam' etc.
default_timezone = Europe/Moscow
[expressions]
# Enable or disable the expressions functionality.
enabled = true
[geomap]
# Set the JSON configuration for the default basemap
default_baselayer_config =
# Enable or disable loading other base map layers
enable_custom_baselayers = true
#################################### Dashboard previews #####################################
[dashboard_previews.crawler]
# Number of dashboards rendered in parallel. Default is 6.
thread_count =
# Timeout passed down to the Image Renderer plugin. It is used in two separate places within a single rendering request:
# First during the initial navigation to the dashboard and then when waiting for all the panels to load. Default is 20s.
# This setting should be expressed as a duration. Examples: 10s (seconds), 1m (minutes).
rendering_timeout =
# Maximum duration of a single crawl. Default is 1h.
# This setting should be expressed as a duration. Examples: 10s (seconds), 1m (minutes).
max_crawl_duration =
# Minimum interval between two subsequent scheduler runs. Default is 12h.
# This setting should be expressed as a duration. Examples: 10s (seconds), 1m (minutes).
scheduler_interval =

#################################### Storage ################################################
[storage]
# Allow uploading SVG files without sanitization.
allow_unsanitized_svg_upload = false

#################################### Search ################################################
[search]
# Defines the number of dashboards loaded at once in a batch during a full reindex.
# This is a temporary settings that might be removed in the future.
dashboard_loading_batch_size = 200
# Defines the frequency of a full search reindex.
# This is a temporary settings that might be removed in the future.
full_reindex_interval = 5m
# Defines the frequency of partial index updates based on recent changes such as dashboard updates.
# This is a temporary settings that might be removed in the future.
index_update_interval = 10s

# Move an app plugin referenced by its id (including all its pages) to a specific navigation section
# Dependencies: needs the `topnav` feature to be enabled
# Format: <Plugin ID> = <Section ID> <Sort Weight>
[navigation.app_sections]
# Move a specific app plugin page (referenced by its `path` field) to a specific navigation section
# Format: <Page URL> = <Section ID> <Sort Weight>
[navigation.app_standalone_pages]

#################################### Secure Socks5 Datasource Proxy #####################################
[secure_socks_datasource_proxy]
enabled = false
root_ca_cert =
client_key =
client_cert =
server_name =
# The address of the socks5 proxy datasources should connect to
proxy_address =

grafana/provisioning/datasources/prometheus.yml

Этот файл необходим, что бы после разворачивания Grafana в источниках данных сразу появился Prometheus.

apiVersion: 1
datasources:
  - name: Prometheus
    type: prometheus
    access: proxy
    url: http://localhost:9090
    jsonData:
      httpMethod: POST
      manageAlerts: true
      prometheusType: Prometheus
      prometheusVersion: 2.42.0

Prometheus:

prometheus/conf/prometheus.yml

Основной файл настроек Prometheus.

global:
  scrape_interval: 15s # Интервал сбора статистики по умолчанию
# Настройки наборов целей для сбора статистики
scrape_configs:
  # Сбор статистики с самого себя
  - job_name: "prometheus"
    static_configs:
      - targets: ["localhost:9090"]

Windows exporter

После того, как всё заработало, я вздохнул с облегчением, теперь можно приступать к основной задаче сбора метрик с сервера. Я не стал долго думать и выбрал Windows exporter из списка экспортеров, предоставленных в официальной документации к Prometheus. Его установка и настройка была довольно простой. Необходимо с помощью стандартной утилиты sc.exe создать службу, которая будет запускать необходимый файл с нужными параметрами, и добавить URL адрес экспортера в настройки Prometheus.

Изменения в проекте

Создание службы windows exporter:

sc.exe create windows_exporter type= own start= auto binpath= "C:\windows_exporter\windows_exporter-0.21.0-amd64.exe --config.file=C:\windows_exporter\config.yml" displayname= "Windows exporter (Prometheus)"

Prometheus:

prometheus/conf/prometheus.yml

global:
  scrape_interval: 15s # Интервал сбора статистики по умолчанию

# Настройки наборов целей для сбора статистики
scrape_configs:
  # Сбор статистики с самого себя
  - job_name: "prometheus"
    static_configs:
      - targets: ["localhost:9090"]

  # Сбор статистики с windows_exporter
  - job_name: "windows_exporter"
    static_configs:
      - targets: ["<IP сервера>:9182"]

После добавления windows exporter я смог получить какие-то реальные данные и добавил несколько графиков. Большинство запросов для них можно найти в документации к самому экспортеру.

Сбор замеров APDEX

Следующим этапом для меня стала отправка метрик из 1С. Вдохновившись статьей Метрики, графики, статистика = Prometheus + Grafana, я решил организовать сбор метрик без создания http-сервиса. Поэтому добавил в стек еще один сервис - pushgateway (его основное назначение можно прочесть в указанной статье). Для 1С я создал очень простую обработку, которая раз в 15 секунд делала довольно простой запрос в базу, забирала подготовленные данные и отправляла их HTTP запросом в pushgateway, а он, в свою очередь, предоставлял их prometheus. Метрики были следующие:

Общая продолжительность выполнения в разрезе ключевой операции. (Counter)
Общее количество выполненных ключевых операций. (Counter)

С помощью этих двух метрик я смог получить среднее время выполнения ключевых операций и количество выполняемых операций в единицу времени.

Изменения в проекте

Prometheus MSSQL Exporter

Далее, для полноты картины, необходимо было собрать метрики с MSSQL. Для этого я так же заглянул в документацию к Prometheus и в списке экспортеров нашел Prometheus MSSQL Exporter, который прекрасно работает в docker и имеет простейшие настройки. После его добавления, я дополнил дашборд группой графиков MSSQL.

Изменения в проекте

Docker:

docker-compose.yml

version: '3.8'

services:
  # Grafana
  grafana:
    image: grafana/grafana:9.4.3
    container_name: grafana
    restart: unless-stopped
    user: root
    volumes:
      # Файлы настроек сервиса grafana
      - ./grafana/conf:/etc/grafana
      # Хранилище данных grafana
      - ./grafana/grafana_data:/var/lib/grafana
    network_mode: "host"

  # Prometheus
  prometheus:
    image: prom/prometheus:v2.42.0
    container_name: prometheus
    restart: unless-stopped
    user: root
    volumes:
      # Файлы настройки сервиса prometheus
      - ./prometheus/conf:/etc/prometheus
      # Файлы БД prometheus
      - ./prometheus/prometheus_data:/prometheus
    command:
      - '--config.file=/etc/prometheus/prometheus.yml'
      - '--storage.tsdb.path=/prometheus'
      - '--web.console.libraries=/etc/prometheus/console_libraries'
      - '--web.console.templates=/etc/prometheus/consoles'
      - '--web.enable-lifecycle'
    network_mode: "host"

  # Pushgateway
  pushgateway:
    image: prom/pushgateway
    container_name: pushgateway
    restart: unless-stopped
    network_mode: "host"

  # Prometheus MSSQL Exporter
  mssqlexporter:
    image: awaragi/prometheus-mssql-exporter:v1.3.0
    container_name: mssqlexporter
    restart: unless-stopped
    network_mode: "host"
    environment:
      - SERVER=<имя/IP сервера>
      - USERNAME=<пользователь>
      - PASSWORD=<пароль>
      - EXPOSE=4000

Prometheus:

prometheus/conf/prometheus.yml

global:
  scrape_interval: 15s # Интервал сбора статистики по умолчанию

# Настройки наборов целей для сбора статистики
scrape_configs:
  # Сбор статистики с самого себя
  - job_name: "prometheus"
    static_configs:
      - targets: ["localhost:9090"]

  # Сбор статистики с windows_exporter
  - job_name: "windows_exporter"
    static_configs:
      - targets: ["<IP сервера>"]

  # Сбор статистики с сервиса Pushgateway
  - job_name: "pushgateway"
    honor_labels: true
    static_configs:
      - targets: ["localhost:9091"]

  # Сбор статистики с Prometheus MSSQL Exporter
  - job_name: "mssql_exporter"
    static_configs:
      - targets: ["localhost:4000"]

Первые итоги

После получения первой версии дашборда, где были собраны начальные графики по работе windows сервера, MSSQL и 1С, я показал его своему руководителю. Мы смотрели на всё это дело около недели, и единственный вывод, который был сделан - это то, что никаких серьезных проблем с железом у нас нет.

Новая задача

По прошествии еще одной недели мне была поставлена новая задача, в которой было описано несколько графиков, которые необходимо было добавить в дашборд. Некоторые из них я добавил сразу, так как данные по этим показателям уже существовали, но в конце меня ждал сюрприз. Я увидел заветное слово "блокировки", а это могло значить только одно: нужно собирать данные из технологического журнала.

Grafana Loki

Про сбор и агрегацию технологического журнала информации в интернете было еще меньше. Я бы сказал, ее вообще практически нет. Единственные упоминания, которые я находил - это агрегация логов с помощью ELK (Elasticsearch, Logstash и Kibana). Потратив 3-4 рабочих дня в попытках хотя бы как-то запустить эту машину, я натыкался на новые и новые препятствия. В конце концов стало понятно, что это решение не для нас. Оно слишком ресурсоемкое, требует очень большого бэкграунда, да и как потом объяснять людям, что с этим всем делать.

Уже не помню, как это произошло, но я наткнулся на Grafana Loki. Сервис описывал себя как агрегатор логов, вдохновленный Prometheus, который:

Легко масштабировать
Экономит данные
Прост в эксплуатации

Эти сладкие речи сильно впечатлили меня, тем более так много знакомых слов звучало в названии и описании. Я решил двигаться в этом направлении.

Loki состоит из двух частей:

Loki - сервис, который является своего рода базой данных хранения журналов и позволяет запрашивать их с помощью встроенного языка LogQL
Promtail - агент, поставляемый с Loki, который собирает и отправляет данные журналов. Агентом, на самом деле, может являться и другие ПО, от того же ELK, но я решил пользоваться встроенным решением.

Добавить Loki к проекту не составило большого труда, я взял примеры настроек из документации и добавил их в проект.

Изменения в проекте

Структура проекта:

Docker:

docker-compose.yml

version: '3.8'

services:
  # Grafana
  grafana:
    image: grafana/grafana:9.4.3
    container_name: grafana
    restart: unless-stopped
    user: root
    volumes:
      # Файлы настроек сервиса grafana
      - ./grafana/conf:/etc/grafana
      # Хранилище данных grafana
      - ./grafana/grafana_data:/var/lib/grafana
    network_mode: "host"

  # Prometheus
  prometheus:
    image: prom/prometheus:v2.42.0
    container_name: prometheus
    restart: unless-stopped
    user: root
    volumes:
      # Файлы настройки сервиса prometheus
      - ./prometheus/conf:/etc/prometheus
      # Файлы БД prometheus
      - ./prometheus/prometheus_data:/prometheus
    command:
      - '--config.file=/etc/prometheus/prometheus.yml'
      - '--storage.tsdb.path=/prometheus'
      - '--web.console.libraries=/etc/prometheus/console_libraries'
      - '--web.console.templates=/etc/prometheus/consoles'
      - '--web.enable-lifecycle'
    network_mode: "host"

  # Pushgateway
  pushgateway:
    image: prom/pushgateway
    container_name: pushgateway
    restart: unless-stopped
    network_mode: "host"

  # Prometheus MSSQL Exporter
  mssqlexporter:
    image: awaragi/prometheus-mssql-exporter:v1.3.0
    container_name: mssqlexporter
    restart: unless-stopped
    network_mode: "host"
    environment:
      - SERVER=<имя/IP сервера>
      - USERNAME=<пользователь>
      - PASSWORD=<пароль>
      - EXPOSE=4000

  # Loki
  loki:
    image: grafana/loki:2.7.5
    container_name: loki
    restart: unless-stopped
    user: root
    volumes:
      # Файлы настройки сервиса loki
      - ./loki/conf:/etc/loki
      # Файлы БД loki
      - ./loki/loki_data:/var/loki
    network_mode: "host"

Grafana:

grafana/provisioning/datasources/loki.yml

Этот файл необходим, что бы после разворачивания Grafana в источниках данных сразу появился Loki.

apiVersion: 1

datasources:
  - name: Loki
    type: loki
    access: proxy
    url: http://localhost:3100
    jsonData:
      maxLines: 1000

Prometheus:

prometheus/conf/prometheus.yml

global:
  scrape_interval: 15s # Интервал сбора статистики по умолчанию

# Настройки наборов целей для сбора статистики
scrape_configs:
  # Сбор статистики с самого себя
  - job_name: "prometheus"
    static_configs:
      - targets: ["localhost:9090"]

  # Сбор статистики с windows_exporter
  - job_name: "windows_exporter"
    static_configs:
      - targets: ["<IP сервера>:9182"]

  # Сбор статистики с сервиса Pushgateway
  - job_name: "pushgateway"
    honor_labels: true
    static_configs:
      - targets: ["localhost:9091"]

  # Сбор статистики с Prometheus MSSQL Exporter
  - job_name: "mssql_exporter"
    static_configs:
      - targets: ["localhost:4000"]

  # Сбор статистики с Loki
  - job_name: "loki"
    static_configs:
      - targets: ["localhost:3100"]

  # Сбор статистики с Promtail
  - job_name: "promtail"
    static_configs:
      - targets: ["<IP сервера>:9080"]

Loki:

loki/conf/local-config.yaml

Файл настроек Loki. Многи настройки ограничений поставлены путем проб и ошибок.

auth_enabled: false

server:
  http_listen_port: 3100
  grpc_server_max_recv_msg_size: 99194304
  grpc_server_max_send_msg_size: 99194304

common:
  instance_addr: 127.0.0.1
  path_prefix: /var/loki
  storage:
    filesystem:
      chunks_directory: /var/loki/chunks
      rules_directory: /var/loki/rules
  replication_factor: 1
  ring:
    kvstore:
      store: inmemory

query_range:
  results_cache:
    cache:
      embedded_cache:
        enabled: true
        max_size_mb: 1000

schema_config:
  configs:
    - from: 2020-01-01
      store: boltdb-shipper
      object_store: filesystem
      schema: v11
      index:
        prefix: index_
        period: 24h

compactor:
  retention_enabled: true

ingester:
  max_chunk_age: 12h

limits_config:
  ingestion_rate_mb: 20
  ingestion_burst_size_mb: 20
  retention_period: 15d
  per_stream_rate_limit: 100MB

Без проблем не обошлось

Основные проблемы меня ждали при настройке Promtail.

Первое, с чем я столкнулся, - это подход к структуре хранения технологического журнала. У нас он был настроен, полагаю, более или менее классическим образом: много папок с отборами, чтобы это можно было хоть как-то пытаться читать и находить проблемы. Плюс все это хранилось 48 часов, и кто имеет представление о ТЖ, тот понимает, что это просто огромная куча файлов на много гигабайт занимаемого пространства. Если promtail указать путь к подобной структуре, то он из-за тысячи файлов просто начнет съедать всю память и процессор на сервере и в конце концов случится большая беда.

Эту проблему я решил довольно простым способом, добавив еще одну настройку в файл logcfg.xml

C:\Program Files\1cv8\conf\logcfg.xml

В файле присутствует только новая настройка.

<?xml version="1.0"?>
<config xmlns="http://v8.1c.ru/v8/tech-log" xmlns:glvsvc="http://www.gilev.ru/service">
	<log location="H:\TECH_LOGS\FOR_LOKI" history="1">
		<event>
		    <eq property="Name" value="CALL"/>
			<eq property="p:processName" value="trade11_2"/>
			<gt property="duration" value="1000"/>
		</event>
		<event>
			<eq property="name" value="SCALL"/>
			<eq property="p:processName" value="trade11_2"/>
			<gt property="duration" value="1000"/>
		</event>
		<event>
			<eq property="name" value="mem"/>
			<eq property="p:processName" value="trade11_2"/>
		</event>
		<event>
			<eq property="name" value="EXCP"/>
			<eq property="p:processName" value="trade11_2"/>
		</event>
		<event>
			<eq property="name" value="TLOCK"/>
			<eq property="p:processName" value="trade11_2"/>
		</event>
		<event>
			<eq property="Name" value="TTIMEOUT"/>
			<eq property="p:processName" value="trade11_2"/>
		</event>
		<event>
			<eq property="Name" value="TDEADLOCK"/>
			<eq property="p:processName" value="trade11_2"/>
		</event>
		<event>
			<eq property="name" value="DBMSSQL"/>
			<eq property="p:processName" value="trade11_2"/>
			<gt property="duration" value="1000"/>
		</event>
		<property name="all"/>
	</log>
</config>

Такая настройки сильно урезала количество файлов в папке, и получилось, что мы хранили текущий и предыдущий час логов. С такой конфигурацией promtail ел 200-300 Мбайт оперативной памяти и 0.5% процессора в нагрузке, что меня вполне устраивало.

Следующая неприятная проблема случилась с тем, что если создавать службу с помощью sc.exe - она отказывалась запускаться. Это, наверное, связано с тем, что программа изначально писалась под Unix системы и для Windows разработчики не сильно старались. Решить эту проблему проблему помогла утилитка, найденная на просторах интернета, которая называется WinSW. По сути это просто обертка для любого исполняемого файла с помощью которой можно создать службу Windows.

Также отмечу еще один нюанс с структурой хранения. 1С внутри корневой папки ТЖ создает подпапки с названием, состоящим из службы и ID процесса. Также 1С любит создавать и убивать свои процессы по необходимости. Если получилась такая ситуация, что 1С хочет удалить папку с завершившимся процессом, а Promtail читает из неё логи, то она не сможет этого сделать, пока не будет перезапущен Promtail. Поэтому необходимо добавить задание в планировщик, которое будет раз в сутки перезапускать службу Promtail'а, чтобы очищать эти папки.

Остальные проблемы были связаны с настройкой самого promtail, так как 1С пишет логи очень не стандартно. В частности:

Часть времени записи есть в название файла , а часть в строке лога.
Запись одного события многострочна.
Внутри строки лога одни поля имеют вид key=value, а некоторые просто являются значениями.

config.yaml

Файл настройки promtail.

server:
  http_listen_port: 9080
  grpc_listen_port: 0

positions:
  filename: C:\_APP\promtail\positions.yaml

clients:
  - url: http://<IP адрес сервара с Loki>:3100/loki/api/v1/push

scrape_configs:
  - job_name: ones-tj
    pipeline_stages:
      - multiline:
          firstline: '^.*[\d]{2}:[\d]{2}.[\d]+-[\d]+,'
          max_wait_time: 3s
          max_lines: 5000
      - regex:
          expression: "^.*(?P<time>[\\d]{2}:[\\d]{2}.[\\d]+)-(?P<duration>[\\d]+?),(?P<event>\\S+?),.*process=(?P<process>\\S+?),"
      - regex:
          expression: "^.*p:processName=(?P<processName>\\S+?),"
      - regex:
          expression: "^.*Usr=(?P<user>.*?),"
      - template:
          source: timestamp
          template: '{{ $val := .filename | regexFind "[1-9][0-9][0-1][0-9][0-3][0-9][0-2][0-9]" }}20{{ $val | substr 0 2 }}-{{ $val | substr 2 4 }}-{{ $val | substr 4 6 }}T{{ $val | substr 6 8 }}:{{ .time }}+03:00'
      - labels:
          event:
          user:
      - timestamp:
          format: RFC3339Nano
          source: timestamp
      - labeldrop:
          - filename
      - metrics:
          event_duration:
            type: Counter
            description: "Event duration from ones tj"
            source: duration
            prefix: promtail_onestj_
            max_idle_duration: 24h
            config:
              action: add

    static_configs:
      - targets:
          - localhost
        labels:
          job: ones-tj
          __path__: H:\TECH_LOGS\_FOR_LOKI\**\*.log

Магия Loki

Для демонстрации мощности Loki приведу пример запроса для получения событий установки блокировки, длительность которых была больше 1 секунды и поле WaitConnections было не пустым:

{event="TLOCK"} | pattern `<_>-<duration>,<_>WaitConnections=<wait_conn>,` | duration > 1000000 | wait_conn =~`\S+`

Разберу выражение, чтобы стало чуть понятнее. Запрос к Loki состоит из обработчиков, разделенных вертикальной чертой. Наш запрос стоит из четырех обработчиков:

{event="TLOCK"} - селектор потока журнала, который говорит, что мы выбираем все записи с событием "TLOCK". Обязательная начальная часть запроса. Стиль, вдохновленный Prometheus.
pattern `<_>-<duration>,<_>WaitConnections=<wait_conn>,` - этот обработчик приводит запись журнала к определенному паттерну и вычисляет дополнительные поля для дальнейшей обработки. В нашем случае мы вычисляем поля duration и wait_conn.
duration > 1000000 - думаю этот обработчик вполне очевиден, мы отбираем все записи, длительность которых больше 1 000 000 микросекунд, то есть больше 1 секунды.
wait_conn =~`\S+` - этот обработчик говорит нам, что поле wait_conn должно удовлетворять регулярному выражению \S+, которое означает, что wait_conn должен быть заполнен.

Итог

В конечном итоге я получил работающий инструмент сбора логов, в котором можно осуществлять удобный поиск необходимых событий, а также удобно проводить расследования по данным ТЖ. В данный момент мы уже провели несколько расследований и смогли исправить некоторые проблемы с взаимоблокировками.

Общая схема работы мониторинга выглядит так:

P.S.

Благодарю читателя, осилившего этот текст. Надеюсь, он был ему интересен и принесет практическую пользу.

В этой статье я намеренно не делал упора на техническую составляющую, так как с ней она превратится в книгу.

К статье я прикрепил архив с проектом Git (только без папки .git), в котором лежат все файлы с конфигурациями, заполнен файл README.md и .gitignore.

мониторинг grafana prometheus pushgateway windows_exporter prometheus-mssql-exporter loki promtail winsw

–

См. также

Конфигурация Session Monitor

Мониторинг Инструменты администратора БД Платформа 1С v8.3 Россия Платные (руб)

Конфигурация Session Monitor предназначена для мониторинга сервера 1С с целью отслеживания чрезмерной нагрузки от конкретных сеансов и скорости реакции рабочих процессов.

1500 руб.

01.12.2020 14434 35 0

Мониторинг баз и серверов 1С

Журнал регистрации Мониторинг Платформа 1С v8.3 Платные (руб)

В сферу обязанностей при работе с клиентами входит контроль работы баз данных и серверов 1С. Нужно понимать что происходит в базах, есть ли ошибки, зависания у пользователей и фоновых задач, блокировки или какое-то необычное поведение системы, получение информации о причинах возникновения проблем и их оперативное устранение и т.д. В качестве источников информации использую консоль кластеров 1С, технологический журнал 1С, журналы регистрации базы 1С. Для автоматизации части операций мониторинга и анализа создал инструмент на основе 1С.

9000 руб.

28.08.2019 31189 14 21

Yellow Watcher - Жёлтый наблюдатель за информационными базами

Мониторинг Платформа 1С v8.3 Абонемент ($m)

Программный комплекс мониторинга качества работы информационных баз. Статистика возникновения управляемых блокировок (тип, последняя строка контекста, контекст). Анализ длительных запросов по данным из технологического журнала. Анализ потребления ресурсов СУБД запросами и статистика ожиданий по данным из Query Store. Монитор информационной базы - получение плана запроса для сеанса 1С. Блокировки СУБД по данным block_report Extented Events, длительные запросы по данным из query_post_execution_showplan Extented Events.

1 стартмани

12.02.2024 3274 27 sdf1979 11

Проверка доступа к интернет на сервере 1С

Мониторинг Платформа 1С v8.3 Конфигурации 1cv8 1С:Бухгалтерия 3.0 Абонемент ($m)

Инструмент для проверки интернет - соединения на сервере 1С

3 стартмани

23.11.2023 1938 6 1395969 4

Магия преобразований Vector, часть 3: журнал регистрации + прямой экспорт ошибок в Sentry

Журнал регистрации Мониторинг Абонемент ($m)

Как легко и быстро с помощью специализированных решений собирать, парсить и передавать логи и метрики.

1 стартмани

19.11.2023 781 3 AlexSTAL 0

Магия преобразований Vector, часть 2: технологический журнал

Технологический журнал Мониторинг Абонемент ($m)

Как легко и быстро с помощью специализированных решений собирать, парсить и передавать логи и метрики.

1 стартмани

15.11.2023 846 4 AlexSTAL 0

Магия преобразований: ЖР, ТЖ, RAS/RAC, логи - универсальное решение Vector

Мониторинг Журнал регистрации Технологический журнал Абонемент ($m)

Как легко и быстро с помощью специализированных решений собирать, парсить и передавать логи и метрики.

1 стартмани

13.11.2023 3179 4 AlexSTAL 0

Чем Service Discovery поможет 1С-нику и его клиентам?

Тестирование QA Мониторинг Бесплатно (free)

Если развернуть слепок рабочей среды в окружении для тестирования, тесты могут начать взаимодействовать с рабочим окружением. Расскажем о том, как автоматически перенастраивать базы 1С под окружение разработки или тестирования с помощью концепции Service Discovery.

08.11.2023 2985 ktb 0

Комментарии

Подписаться на ответы Инфостарт бот

Свернуть все

1. dsdred 3330 10.05.23 08:52 Сейчас в теме

Тоже не так давно настраивал монитор промитеус + графана.
Отличная статья.

2. Ifoxy 16 10.05.23 10:09 Сейчас в теме

Отличная статья. Очень полезная!

3. starik-2005 3037 10.05.23 10:31 Сейчас в теме

4. G_117284249425563207239 10.05.23 11:57 Сейчас в теме

Крутая статья!

5. frkbvfnjh 787 10.05.23 12:11 Сейчас в теме

Наконец-то кто то написал про Графану, вопрос в том смогу ли я понять прочитанное... Но статья нужная!

6. sevushka 303 10.05.23 13:03 Сейчас в теме

А зачем все так усложнять?
мониторинг - заббикс, практически "из коробки" все работает. и железо, и мс скл.
Apdex в 1с, блокировки (latch, lock) и прочее - есть сервисы В.Гилева, которые уже много лет как работают, причем даже в бесплатной версии они достаточно неплохи. Ну и ставятся они за несколько часов (если первый раз), и написаны на 1с и для 1с. Минус - если нужна аналитика за длительный период то это деньги, но многим хватает 3 дня в бесплатной версии.
Какие задачи решились после внедрения "вот этого всего"? Убедились, что нет нагрузок на железо (причем далеко не факт, что выбрали все нужные параметры) и нашли несколько взаимоблокировок?
Ни в коем случае не критикую саму статью, это очень здорово, что нач отдела дал отдохнуть 1снику, и вместо задач заказчиков он делал общественно-админско-полезные задачи для отдела.

Ответить

7. andreysidor4uk 194 10.05.23 14:06 Сейчас в теме

(6) Первое почему: это то, что по моему мнению, это не усложнение, а просто настроенный набор инструментов.
Zabbix - это тоже инструмент который надо настроить. Просто для его настройки уже есть огромное количество статей, многие это делали и это не выглядит чем-то сложным.
Второе почему - это поставленная задача. Мне указали пример, который хотел бы видеть руководитель. И показал он именно Grafana, а не zabbix, который, в прочем, у нас тоже есть. Но почему у нас две системы мониторинга - это внутренняя история, не для огласки.
Третье почему: это просто вкусовщина. Для меня Grafana выглядит намного лучше, чем zabbix.
Четвертое почему: все инструменты собраны в одном месте, мне не нужно открывать 1С, смотреть в кучи абсолютно разных интерфейсов. Всё выглядит одинаково, в одном месте. Работает чётко и быстро. И это сильно подкупает.
Пятое почему: это абсолютно бесплатно и это просто масштабируется (очень не факт, что это понадобится, но такая возможность есть).

По поводу инструментов написанных на 1С - да, можно пользоваться и ими, уверен, что они со своей работой справляются. Но, во-первых, как было вами же замечено, они только частично бесплатные. Во-вторых, по моему мнению, 1С создавалась для других целей и парсинг файлов ТЖ, сбор данных системы - это не ее стезя. Ну и в-третьих, скорость. Возможно, конечно, это я такой невезучий, но все похожие обработки, решающие подобные проблемы, у меня работали достаточно медленно. Но уверен есть те, что работают быстро, просто я таких пока не встречал.

По поводу решенных проблем пока рано судить, инструмент новый и мы его еще меняем, добавляем одни графики, убираем другие.
На текущий момент самый полезный инструмент - это Loki. Анализ ТЖ - сказка, всё быстро и легко. Как было написано в статье - пару проблем уже решили и сейчас, раз в неделю, собираемся и делаем анализ ТЖ и ищем другие ошибки.

Ответить

9. siamagic 10.05.23 14:33 Сейчас в теме

(7) Из написанного пользу представляет только тж. свистелка в виде графаны дает нулевую информативность. к тому же все есть в штатном мониторинге винды. От анализа кластера 1с также толку будет больше ну вы только в начале пути - поэтому не понимаете.

19. sevushka 303 11.05.23 05:47 Сейчас в теме

(7) Спасибо, вполне аргументированный ответ.

8. SGordon1 10.05.23 14:23 Сейчас в теме

про не обновляется конфа в начале статьи - запутывает читателя.... А счастливый конец то наступил, нашли грабельки?

10. andreysidor4uk 194 10.05.23 14:33 Сейчас в теме

(8) Не обновляется конфигурация поставщика конечно же. А конфигурация в целом дописывается нон-стоп. =)
Пока грабельки не найдены, но инструмент новый и мы в процессе)

11. kser87 2441 10.05.23 15:32 Сейчас в теме

Крутейшая статья. Фактически готовая инструкция. Спасибо автору

12. Dach 373 10.05.23 18:08 Сейчас в теме

(0) не знаю, какой IDE Вы пользуетесь, но поставьте себе плагин Spell Checker и ru-словарь для него (или аналоги) - в readme и прочих файлах много орфографических ошибок

В readme также не хватает напоминаний о том, что неплохо бы сначала под себя все конф-файлы настроить.

За единый сервис в докер - однозначно плюс, еще не тестил, но думаю должно все заработать, раз уже заработало ранее.

13. andreysidor4uk 194 10.05.23 18:31 Сейчас в теме

(12) Спасибо за замечание, учту на будущее. =)
Проект git взят из существующего репозитория на внутреннем gitlab. Я удалял оттуда наши логины, пароли, IP адреса и т.д. и забыл про это написать в README...

14. Dach 373 10.05.23 18:34 Сейчас в теме

(13) нагрузку, создаваемую контейнерами - отслеживаете как-то? через portainer, например? Будет хорошо, если покажете скриншоты нагрузки контейнеров

Ну или хотя бы скрин команды htop с хостовой машины с докером, чтобы понимать какой расход ресурсов примерно будет

15. andreysidor4uk 194 10.05.23 18:49 Сейчас в теме

(14) За нагрузкой особо не следил, так как заходя на сервер в htop нагрузки совсем никакой.

Прикрепленные файлы:

16. Dach 373 10.05.23 19:12 Сейчас в теме

(15) а какой объем логов ТЖ за 1 час при тех настройках, что сделаны? в моменты пиковых нагрузок, в середине дня например

"Если получилась такая ситуация, что 1С хочет удалить папку с завершившимся процессом, а Promtail читает из неё логи, то она не сможет этого сделать, пока не будет перезапущен Promtail. Поэтому необходимо добавить задание в планировщик, которое будет раз в сутки перезапускать службу Promtail'а, чтобы очищать эти папки."

и вот это не особо понятно:
- во-первых, он что - "держит" все папки, из которых что-то читал (даже если вотпрямщас не читает оттуда)?
- во-вторых - как перезапуск его службы помогает - папки сами удаляются что ли?

17. andreysidor4uk 194 10.05.23 19:52 Сейчас в теме

(16) Сколько получается за 1 час не могу сказать, нет такой информации. При текущих настройках все логи, что в prometheus, что в loki хранятся 15 дней. Я прикрепил файл с данными занимаемого пространства в текущий момент.

во-первых, он что - "держит" все папки, из которых что-то читал (даже если вотпрямщас не читает оттуда)?

Да, именно, он не дает удалять эти папки. И при попытке зайти в такую папку или удалить ее - будет ошибка доступа, скрин тоже приложу.

во-вторых - как перезапуск его службы помогает - папки сами удаляются что ли?

Да, все верно. Стоит только остановить службу promtail и папки сразу же исчезнут. Думаю стоит написать issue на гитлаб по этому поводу, но у меня пока руки не дошли.

Прикрепленные файлы:

18. PerlAmutor 129 11.05.23 05:20 Сейчас в теме

Никто не сталкивался с проблемой, когда при слишком частом обновлении в браузере, который показывает дашборды графаны забивается вся память и он просто прекращает показывать что-либо, висит с диалогом, что страница не отвечает? Мы поигрались с графаной какое-то время, смотреть графики НЕ в реальном времени стало не интересно. Вернулся к обычному анализу ТЖ.

20. andreysidor4uk 194 11.05.23 07:07 Сейчас в теме

(18) Пока такой проблемы не встречал, возможно пофиксили.
Для ТЖ можно оставить из стека только Grafana, Loki, Promtail и радоваться удобному анализу. Анализ журнала через Loki - это прям другой уровень ощущений, рекомендую)))

21. bulpi 215 11.05.23 08:13 Сейчас в теме

"уменьшение времени проведения заказа покупателя.

За первые три месяца было проделано довольно много работы"

Вы живете в какой-то другой вселенной. За 3 месяца я разбираюсь в задаче с нуля, пишу конфу, внедряю, обучаю сотрудников, сопровождаю. И мониторинг не нужен, так как все работает быстро.
Где я свернул не туда ?

23. andreysidor4uk 194 11.05.23 08:31 Сейчас в теме

(21) Поздравляю, надеюсь мы когда-то достигнем вашего уровня!

22. JohnyDeath 301 11.05.23 08:19 Сейчас в теме

Отличная статья.
Я у себя тоже развернул похожую красоту. Пока не сделал APDEX и ТЖ, но это прямо в ближайших планах.
Но зато есть мониторинг Журнала регистрации: вывел в табло общее количество ошибок ЖР и ниже табличкой сами ошибки. Сам ЖР экспортируется в Кликхаус прекрасной утилитой https://github.com/akpaevj/OneSTools.EventLog.
Еще прикрутил мониторинг используемых лицензий и прочих внутренних показателей 1с-сервера с помощью https://github.com/JohnyDeath/prometheus_1C_exporter

Прикрепленные файлы:

24. andreysidor4uk 194 11.05.23 08:34 Сейчас в теме

(22) Спасибо
И спасибо за отличные ссылки, не находил этих инструментов. Изучу, возьму на вооружение, думаю пригодится)

32. triviumfan 93 12.05.23 16:00 Сейчас в теме

(22) Красота

25. JohnyDeath 301 11.05.23 09:28 Сейчас в теме

Можете показать обработку по отправке данных APDEX?

Для 1С я создал очень простую обработку, которая раз в 15 секунд делала довольно простой запрос в базу, забирала подготовленные данные и отправляла их HTTP запросом в pushgateway

26. andreysidor4uk 194 11.05.23 10:00 Сейчас в теме

(25) Специально не выкладывал обработку, потому что она была написана на коленке и пока что не очень подходит для презентации, но видимо придется)))

Прикрепленные файлы:

PrometheusExporter.epf

41. JohnyDeath 301 07.07.23 11:37 Сейчас в теме

(26) Подскажи, пожалуйста, зачем ты в коде счетчик по количеству операций и их длительность суммируешь с предыдущим отправленным значением?

Значение = ТекущиеЗначения.Получить(Стр.КлючеваяОперация + "count");
		Если Значение = Неопределено Тогда
			Значение = 0;	
		КонецЕсли;
		Значение = Значение + Стр.КоличествоВыполнений;
		
		ЗаписатьМетрику(
			Запись,
			ИмяМетрики,
			Значение,
			Новый Структура("operation", Стр.КлючеваяОперация)
		);

Показать

т.е. получается, что у меня в Замерах есть две записи одной и той же ключевой операции с длительностью, допустим, 10 сек. каждая
Отправляем в прометей данные: Количество = 2, Длительность = 20
Далее появляется еще 3 записи по 15 сек. каждая.
И в этот раз отправится не (Количество = 3 и Длительность = 45), а (Количество = 5 и Длительность = 65).
Почему сделано именно так?

42. andreysidor4uk 194 08.07.23 10:51 Сейчас в теме

(41) В Prometheus есть несколько возможных типов метрик. Histogram и Summary я отмел из-за их сложности и отсуствия времени для разбирательства. Остался тип Gauge, который подходит для величин, которые могут увеличиваться и уменьшаться. Но он мне не подходит, так как я хотел видеть две метрики: средняя длительность выполнения операций и количество операций в единицу времени, что бы их вычислить надо иметь метрики с типом Counter. А Counter - это монотонно увеличивающийся счетчик, который я и организовал.
Возможно вам еще даст чуть больше информации прочтение документации по типам метрик.
https://prometheus.io/docs/concepts/metric_types/

28. user1946710 11.05.23 16:44 Сейчас в теме

Тоже в свое время "игрался" с плагинами и прочей оберткой для графаны, в итоге нашел свой идеал в связке telegraf-》influxdb - 》grafana. На дашбоарде вывожу смесь из API данных надерганых из различных серверов, БД, ПО и даже есть данные спарсеные с html страниц. В общем что угодно без ограничений и привязки к забиксам, пртг и прочему

29. JohnyDeath 301 11.05.23 21:34 Сейчас в теме

(28)
В этой связке

telegraf-》influxdb - 》grafana

telegraf - это https://github.com/telegraf/telegraf ? Т.е. вся информация сначала отдается из какой-то системы/БД в бот телеграму, который перекидывает её в influxdb, уже к которой прикручена графана? Всё верно?

что угодно без ограничений и привязки к забиксам, пртг и прочему

А чем в данном случае influxdb отличается от забикса, пртг и "прочего"? Это такая же СУБД, заточенная на хранение временных рядов. В чем преимущества?

30. andreysidor4uk 194 12.05.23 06:38 Сейчас в теме

(29) Могу ответить про телеграф. Вы немного не тот телеграф нашли, вот нужный https://www.influxdata.com/time-series-platform/telegraf/.

31. user1946710 12.05.23 13:10 Сейчас в теме

(29)
Телеграф и ИнлюксБД были выбраны как связка которая позволяет выполнить любое действие а его результат записать в БД с минимум усилий. К примеру есть http rest API строка которая нам возвращает определенные данные в формате json вот такая строка конфигурации телеграфа запросит эти данные, распарсит, и запишет в БД.

[[inputs.http]]
insecure_skip_verify=true
urls = ["https://_____/api/table.json?username=&passhash=&content=sensors&id=6338&columns=objid,type,name,device,host,status,group"]
interval = "16s"
method = "GET"
data_format = "json"
json_query = "sensors"
tag_keys = ["objid"]
json_string_fields = ["type","name","device","host","status_raw","group"]
name_override = "http"

А вот такая строка конфига запускает скрипт который вернет данные в формате csv и сложет в БД

[[inputs.exec]]
# graylog search message cp fan

commands = ["/Python39/Script/greylog/greylog_CP_fan.sh"]
interval = "300s"
timeout = "25s"
data_format = "csv"
csv_header_row_count = 0
csv_column_names = ["Mytime","IP"]
csv_skip_rows = 0
csv_skip_columns = 0
csv_delimiter = "&"
csv_comment = ""
csv_trim_space = true
csv_tag_columns = ["Mytime"]
name_override = "exec_grlcpfan"
[inputs.exec.tags]
influxdb_database = "exec_grlcpfan"

Дальше с стороны графаны стандартный плагин включенные в графану по работе с инфлюкс и обычными запросами выдергиваем из БД нужные данные
SEL ECT IP fr om "100h"."exec_grlcpfan" where IP =~ // AND time >= now()-3000s

И рисуем эти данные как угодно , пример дашборда на скриншоте

В целом мой подход при поверхностном осмотре может показатся сложным, но если разобраться - он позволит вывести в графану что угодно без лишних стеков и прослоек . При желании можно сложить скрипты в kron а хранить данные в PostgreSQL или любая другая вариация которая даст нам то же, но будет по моему мнению излишне усложненной.

Прикрепленные файлы:

33. JohnyDeath 301 12.05.23 16:05 Сейчас в теме

(31) как это "без лишних стеков", если вы предлагаете свой стек из как минимум двух новых технологий?

34. user1946710 12.05.23 19:57 Сейчас в теме

Без лишних - в плане этот стек универсальный. Я собираю данные из десятка различных серверов с различными сервисами + запрашиваю напрямую из различных БД, и есть даже скрипты которые открывают нужные страницы и парсят нужные данные из html. Мой подход позволит работать с чем угодно, многие другие решения ограничены своим предназначением и набором функционала.

35. ropots 19.05.23 10:31 Сейчас в теме

Спасибо за статью, очень заинтересовался.
Все получилось кроме ms sql exporter почему 4000 порт?
и в docker-compose.yml какой ip должен быть?
в Prometheus Targets Endpoint mssql_exporter ошибка,порт 1443 должен быть использован?
Вообщем не могу понять какая служба берет данные метрики с MS SQL

36. andreysidor4uk 194 19.05.23 18:03 Сейчас в теме

(35) 4000 - это порт, на котором будут доступны метрики, если перейти по адресу http://<IP сервера монторинга>:4000/metrics
В docker-compose.yml в адресе сервера должен быть IP адрес сервера MSSQL.
Прометеус в таргетах может ругаться из-за того, что служба mssql_exporter не поднимается. Возможно она поднимается как раз из-за неправильных параметров.

47. ropots 26.10.23 07:23 Сейчас в теме

(36)
Добрый день, можете показать состав файла

filename: C:\_APP\promtail\positions.yaml

48. andreysidor4uk 194 26.10.23 08:20 Сейчас в теме

(47) Добрый день, да, конечно. Файл в текущий момент времени выглядит вот так:

positions:
  H:\TECH_LOGS\_FOR_LOKI\crserver_4836\23102607.log: "798"
  H:\TECH_LOGS\_FOR_LOKI\crserver_4836\23102608.log: "0"
  H:\TECH_LOGS\_FOR_LOKI\httpd_35780\23102607.log: "103521"
  H:\TECH_LOGS\_FOR_LOKI\httpd_35780\23102608.log: "32790"
  H:\TECH_LOGS\_FOR_LOKI\mmc_34752\23102607.log: "0"
  H:\TECH_LOGS\_FOR_LOKI\mmc_34752\23102608.log: "0"
  H:\TECH_LOGS\_FOR_LOKI\ragent_7436\23102607.log: "1388875"
  H:\TECH_LOGS\_FOR_LOKI\ragent_7436\23102608.log: "506634"
  H:\TECH_LOGS\_FOR_LOKI\rmngr_42452\23102607.log: "611723730"
  H:\TECH_LOGS\_FOR_LOKI\rmngr_42452\23102608.log: "214893943"
  H:\TECH_LOGS\_FOR_LOKI\rphost_2736\23102607.log: "271770"
  H:\TECH_LOGS\_FOR_LOKI\rphost_3808\23102607.log: "4061128"
  H:\TECH_LOGS\_FOR_LOKI\rphost_11252\23102607.log: "25088"
  H:\TECH_LOGS\_FOR_LOKI\rphost_14844\23102607.log: "3531592"
  H:\TECH_LOGS\_FOR_LOKI\rphost_25616\23102607.log: "654632153"
  H:\TECH_LOGS\_FOR_LOKI\rphost_25616\23102608.log: "144402541"
  H:\TECH_LOGS\_FOR_LOKI\rphost_26000\23102607.log: "51034"
  H:\TECH_LOGS\_FOR_LOKI\rphost_33548\23102607.log: "635157063"
  H:\TECH_LOGS\_FOR_LOKI\rphost_33548\23102608.log: "106582438"
  H:\TECH_LOGS\_FOR_LOKI\rphost_34008\23102607.log: "23810"
  H:\TECH_LOGS\_FOR_LOKI\rphost_38388\23102607.log: "912624"
  H:\TECH_LOGS\_FOR_LOKI\rphost_42036\23102607.log: "2112659"
  H:\TECH_LOGS\_FOR_LOKI\rphost_42632\23102607.log: "1864189"
  H:\TECH_LOGS\_FOR_LOKI\rphost_44268\23102607.log: "8234147"
  H:\TECH_LOGS\_FOR_LOKI\rphost_44912\23102607.log: "22538628"
  H:\TECH_LOGS\_FOR_LOKI\rphost_48208\23102608.log: "917115"
  H:\TECH_LOGS\_FOR_LOKI\rphost_50708\23102607.log: "20544"
  H:\TECH_LOGS\_FOR_LOKI\rphost_51088\23102607.log: "5017089"
  H:\TECH_LOGS\_FOR_LOKI\rphost_51572\23102607.log: "9254727"
  H:\TECH_LOGS\_FOR_LOKI\rphost_51572\23102608.log: "195270568"
  H:\TECH_LOGS\_FOR_LOKI\rphost_53512\23102607.log: "6107015"
  H:\TECH_LOGS\_FOR_LOKI\rphost_53600\23102607.log: "58121471"
  H:\TECH_LOGS\_FOR_LOKI\rphost_53600\23102608.log: "113177044"
  H:\TECH_LOGS\_FOR_LOKI\rphost_53700\23102607.log: "3057523"

Показать

49. ropots 26.10.23 09:42 Сейчас в теме

(48) т.е. изначально он просто пустой, promtail сам туда пишет свои точки остановки, верно?

50. andreysidor4uk 194 26.10.23 10:31 Сейчас в теме

(49) Да, всё верно. Promtail сам создает и заполняет этот файл.

37. ptica 26.05.23 20:20 Сейчас в теме

Спасибо. Очень интересно. Как понимаю в loki есть метки которые относятся ко всему файлу, например filename и есть метки которые мы можем получить из строки, например event, user. То есть если мы извлекаем логи из одного файла, то у всех строк будет одинаковая метка filename но в общем случае разные event, user. Но можно ли как то получить метку из одной строки и применить ее для остальных строк? Такая задача есть для файлов логов jenkins, там вначале лога есть строка с информацией на какой машине выполняется pipeline, и очень бы хотелось сделать из этой строки метку которую применить к остальным строкам, чтобы потом удобно фильтровать логи по имени машины.

38. andreysidor4uk 194 28.05.23 18:53 Сейчас в теме

(37) Вообще я намеренно убирал тег filename, так как 1С генерит слишком много уникальных имен файлов, а документация Loki не рекомендует делать метки с большим количеством уникальных значений.
Буквально на прошлой неделе нам понадобилось добавить в лог информацию о pid процесса, по которому пишутся логи, а он находится как раз в части имени файла. Я не стал добавлять его как метку (по тем же причинам, что и filename), а добавил просто в начало записи каждого лога. Конфигурация promtail:

server:
  http_listen_port: 9080
  grpc_listen_port: 0

positions:
  filename: C:\_APP\promtail\positions.yaml

clients:
  - url: http://<IP сервера с Loki>:3100/loki/api/v1/push

# Конфиг
scrape_configs:
  - job_name: ones-tj
    pipeline_stages:
      - multiline:
          firstline: '^.*[\d]{2}:[\d]{2}.[\d]+-[\d]+,'
          max_wait_time: 3s
          max_lines: 5000
      - regex:
          expression: "^.*(?P<time>[\\d]{2}:[\\d]{2}.[\\d]+)-(?P<duration>[\\d]+?),(?P<event>\\S+?),.*process=(?P<process>\\S+?),"
      - regex:
          expression: "^.*p:processName=(?P<processName>\\S+?),"
      - regex:
          expression: "^.*Usr=(?P<user>.*?),"
      - template:
          source: timestamp
          template: '{{ $val := .filename | regexFind "[1-9][0-9][0-1][0-9][0-3][0-9][0-2][0-9]" }}20{{ $val | substr 0 2 }}-{{ $val | substr 2 4 }}-{{ $val | substr 4 6 }}T{{ $val | substr 6 8 }}:{{ .time }}+03:00'
      - template:
          source: pid
          template: '{{ $val := .filename | regexFind "_[0-9]+" }}{{ $val | regexFind "[0-9]+"}}'
      - template:
          source: message
          template: 'pid={{ .pid }},{{ .Entry }}'
      - labels:
          event:
          user:
      - timestamp:
          format: RFC3339Nano
          source: timestamp
      - output:
          source: message
      - labeldrop:
          - filename
      - metrics:
          event_duration:
            type: Counter
            description: "Event duration from ones tj"
            source: duration
            prefix: promtail_onestj_
            max_idle_duration: 24h
            config:
              action: add

    static_configs:
      - targets:
          - localhost
        labels:
          job: ones-tj
          __path__: H:\TECH_LOGS\_FOR_LOKI\**\*.log

Показать

Вычленение pid и добавление его в начало лога это вот эти две настройки:

      - template:
          source: pid
          template: '{{ $val := .filename | regexFind "_[0-9]+" }}{{ $val | regexFind "[0-9]+"}}'
      - template:
          source: message
          template: 'pid={{ .pid }},{{ .Entry }}'

Вам надо сделать примерно тоже самое, только изменить регулярку и новую метку добавить в labels, а добавление в сам лог - по желанию. Надеюсь помог =)

39. ybuuth 33 20.06.23 08:52 Сейчас в теме

Крутейший пост. А как связаны Локи и Прометеус? у вас на схеме они стрелками связаны, но из текста я не увидел их связь
Взял ваши настройки для Промтейла, но он отказался запускаться пока не изменил timestamp на другую переменную (time), может кому поможет.

- template:
          source: time
          template: '{{ $val := .filename | regexFind "[1-9][0-9][0-1][0-9][0-3][0-9][0-2][0-9]" }}20{{ $val | substr 0 2 }}-{{ $val | substr 2 4 }}-{{ $val | substr 4 6 }}T{{ $val | substr 6 8 }}:{{ .time }}+03:00'
      - labels:
          event:
          user:
      - timestamp:
          format: RFC3339Nano
          source: time

Показать

40. andreysidor4uk 194 20.06.23 20:19 Сейчас в теме

(39) Спасибо. Loki так же предоставляет метрики как и остальные сервисы. Prometheus собирает эти метрики, так он связан на схеме. В тексте об этом я не говорил, так как это не очень важная деталь, но связь как таковая есть, по-этому она указана на схеме =)

43. kuza_87 28 08.08.23 12:39 Сейчас в теме

Большое спасибо за интересную и содержательную статью. Буду скоро тоже мониторинг разворачивать и воспользуюсь твоими доработками. подскажи, какие параметры сервера на Debian для всех твоих служб? (память/проц/ядра)

44. andreysidor4uk 194 15.08.23 21:01 Сейчас в теме

(43) Параметры в фото. RAM даже много выделено, думаю и 4-5GB хватило бы с головой. Диск тоже выделен в 200GB только из-за того, что логи и собираемые метрики хранятся 30 дней, а не 2 недели (по-умолчанию)

Прикрепленные файлы:

45. ybuuth 33 23.09.23 22:08 Сейчас в теме

Андрей, мне кажется что в вашем compose.yaml у Графаны должна быть строка копирования ваших источников данных, которые вы определили в каталоге grafana/provisioning/datasources

volumes:
    # Grafana's DB files
      - grafana_data:/var/lib/grafana
    #автоподключение источника данных prometheus, loki
      - ./grafana/provisioning/datasources:/etc/grafana/provisioning/datasources

У вас я ее не увидел, каким образом вы копируете? Я еще разбираюсь в докере, может быть детский вопрос. Буду признателен за помощь

46. andreysidor4uk 194 06.10.23 15:22 Сейчас в теме

(45) Добрый день. Может я не правильно понял вопрос, но за каталог grafana/conf/provisioning/datasources отвечает строка:

- ./grafana/conf:/etc/grafana

Все вложенные каталоги прокинуться автоматически.

Оставьте свое сообщение

E-mail:

Автор:

Андрей Сидорчук (andreysidor4uk)

Рейтинг: 194

Для получения уведомлений о новых публикациях автора подключите телеграм бот: Инфостарт бот

Публикация:

№ 1859181

Создание 10.05.23 07:00

Обновление 10.05.23 07:00

Статистика:

Просмотры 15144

Загрузки 16

Рейтинг 146

Комментарии 49

Характеристики:

Код открыт Да

Рубрики Мониторинг

Кому Системный администратор ,
Программист

Тип файла Архив с данными

Платформа Не имеет значения

Конфигурация Не имеет значения

Операционная система Не имеет значения

Страна Россия

Отрасль Не имеет значения

Налоги Не имеет значения

Вид учета Не имеет значения

Доступ к файлу Абонемент ($m)