AWS CloudWatch是用于实时监控AWS资源以及运行在AWS上的应用的一个服务。CloudWatch支持通过AWS SNS服务发送告警消息,您只需要在AWS
SNS中配置日志服务开放告警接口的URL,即可将CloudWatch告警消息发送给日志服务,由日志服务告警系统完成告警降噪、通知等处理。
前提条件
已创建协议为CloudWatch的开放告警应用。具体操作,请参见配置开放告警对外接口。
CloudWatch配置
- 登录AWS管理控制台。
- 创建SNS主题。您需在Amazon SNS控制台上配置如下必填参数。具体操作,请参见Creating an Amazon SNS topic。
参数 说明 Type 主题的类型,选择Standard。 Name 主题的名称。 - 订阅SNS主题。您需在Amazon SNS控制台上配置如下必填参数。具体操作,请参见Subscribing to an Amazon SNS topic。
参数 说明 Topic ARN 您在步骤2中所创建的主题的ARN。 Protocol 协议,选择HTTP。 Endpoint 配置为您在日志服务中创建开放告警服务和应用后生成的接口信息(完整URL)。如何获取,请参见获取接口信息。 Enable raw message delivery 选中Enable raw message delivery复选框。 配置完成后,订阅处于Pending confirmation状态。此时AWS SNS将给日志服务发送一条订阅确认消息,日志服务收到该消息后会自动访问消息中的订阅确认链接。访问成功后,订阅处于Confirmed状态,表示订阅成功。
说明 如果未订阅成功,您可以选中目标订阅后,单击Request confirmation,重新发送一条订阅确认消息。如果仍未成功,您可以在日志服务的告警排障中心查看错误日志。 - 选择您要接入日志服务的告警并添加通知方式。您需在CloudWatch控制台上的目标告警编辑页面添加两个通知方式,相关说明如下。具体操作,请参见To edit an alarm。
- Alarm state trigger:选择触发告警的状态。
- 其中一个通知方式的Alarm state trigger配置为In alarm或Insufficient data,表示告警处于对应的状态时,系统发送告警通知。
- 另一个通知方式的Alarm state trigger配置为OK,表示告警恢复时,系统发送一条恢复通知。
- Select an SNS topic:选择Select an existing SNS topic。
- Send a notification to…:选择您在步骤2中创建的主题。
- Alarm state trigger:选择触发告警的状态。
CloudWatch告警消息
CloudWatch告警分为静态阈值告警和异常检测告警。静态阈值告警消息和异常检测告警消息的Trigger字段的值不同。更多信息,请参见CloudWatch::Alarm属性说明。
- 静态阈值告警消息中的Trigger字段值包含MetricName和Dimensions等字段。
- 异常检测告警消息值的Trigger字段值包含Metrics等字段,其中Metrics字段值是一个指标数据查询列表。
- 静态阈值告警消息
{ "AlarmName": "test-alert", "AlarmDescription": "this is a test alert", "AWSAccountId": "123456", "NewStateValue": "ALARM", "NewStateReason": "Threshold Crossed: 1 out of the last 1 datapoints [1.0 (04/08/21 03:06:00)] was greater than or equal to the threshold (1.0) (minimum 1 datapoint for OK -> ALARM transition).", "StateChangeTime": "2021-08-04T03:10:10.215+0000", "Region": "US East (Ohio)", "AlarmArn": "arn:aws:cloudwatch:us-east-2:123456:alarm:test-alert", "OldStateValue": "OK", "Trigger": { "MetricName": "NumberOfMessagesPublished", "Namespace": "AWS/SNS", "StatisticType": "Statistic", "Statistic": "SUM", "Unit": null, "Dimensions": [ { "value": "my-topic", "name": "TopicName" } ], "Period": 60, "EvaluationPeriods": 1, "ComparisonOperator": "GreaterThanOrEqualToThreshold", "Threshold": 1.0, "TreatMissingData": "- TreatMissingData: missing", "EvaluateLowSampleCountPercentile": "" } }
- 异常检测的告警消息
{ "AlarmName": "cpu alrm", "AlarmDescription": "this is a cpu alarm", "AWSAccountId": "123456", "NewStateValue": "INSUFFICIENT_DATA", "NewStateReason": "Threshold Crossed: no datapoints were received for 2 periods and 2 missing datapoints were treated as [Breaching].", "StateChangeTime": "2021-08-05T08:38:47.104+0000", "Region": "US East (Ohio)", "AlarmArn": "arn:aws:cloudwatch:us-east-2:123456:alarm:cpu alrm", "OldStateValue": "OK", "Trigger": { "Period": 60, "EvaluationPeriods": 2, "ComparisonOperator": "GreaterThanUpperThreshold", "ThresholdMetricId": "ad1", "TreatMissingData": "- TreatMissingData: breaching", "EvaluateLowSampleCountPercentile": "", "Metrics": [ { "Id": "m1", "MetricStat": { "Metric": { "Dimensions": [ { "value": "i-1a2b3c4d", "name": "InstanceId" } ], "MetricName": "CPUUtilization", "Namespace": "AWS/EC2" }, "Period": 60, "Stat": "Average" }, "ReturnData": true }, { "Expression": "ANOMALY_DETECTION_BAND(m1, 0.1)", "Id": "ad1", "Label": "CPUUtilization (预期)", "ReturnData": true } ] } }
告警消息映射
CloudWatch告警被接入到日志服务后,映射为日志服务告警内容。示例如下:
- 静态阈值告警消息
{ "aliuid": "aliuid1", "alert_instance_id": "{自动生成}", "alert_id": "CloudWatch_test-alert", "alert_type": "sls_pub", "alert_name": "test-alert", "region": "{告警中心Project所在地域}", "project": "{告警中心所属的Project}", "project_id": 0, "next_eval_interval": 60, "alert_time": 1628046610, "fire_time": 1628046610, "fire_results": null, "fire_results_count": 0, "resolve_time": 0, "status": "firing", "results": null, "labels": { "TopicName": "my-topic", "__comparison_operator__": "GreaterThanOrEqualToThreshold", "__statistic__": "SUM", "__statistic_type__": "Statistic", "__threshold__": "1", "metric_name": "NumberOfMessagesPublished" }, "annotations": { "__alarm_arn__": "arn:aws:cloudwatch:us-east-2:123456:alarm:test-alert", "__aws_accountId__": "123456", "__aws_region__": "US East (Ohio)", "__cloud_watch_alert_type__": "StaticThreshold", "__config_app__": "sls_pub_alert", "__pub_alert_app__": "{开放告警应用ID}", "__pub_alert_protocol__": "cloud_watch", "__pub_alert_region__": "{接收告警消息的网络接口对应的地域}", "__pub_alert_service__": "{开放告警服务ID}", "desc": "this is a test alert", "title": "Threshold Crossed: 1 out of the last 1 datapoints [1.0 (04/08/21 03:06:00)] was greater than or equal to the threshold (1.0) (minimum 1 datapoint for OK -> ALARM transition)." }, "severity": 10, "policy": { "alert_policy_id": "{开放告警应用中配置的告警策略ID}", "action_policy_id": "{开放告警应用中配置的行动策略ID}", "use_default": false, "repeat_interval": "{开放告警应用中配置的重复等待时间}" }, "template": null, "drill_down_query": "https://us-east-2.console.aws.amazon.com/cloudwatch/home?region=us-east-2#alarmsV2:alarm/test-alert" }
- 异常检测告警消息
{ "aliuid": "aliuid1", "alert_instance_id": "{自动生成}", "alert_id": "CloudWatch_cpu alrm", "alert_type": "sls_pub", "alert_name": "cpu alrm", "region": "{告警中心Project所在地域}", "project": "{告警中心所属的Project}", "project_id": 0, "next_eval_interval": 120, "alert_time": 1628152727, "fire_time": 1628152727, "fire_results": null, "fire_results_count": 0, "resolve_time": 0, "status": "firing", "results": null, "labels": { "__comparison_operator__": "GreaterThanUpperThreshold", "__threshold_metricId__": "ad1" }, "annotations": { "__alarm_arn__": "arn:aws:cloudwatch:us-east-2:123456:alarm:cpu alrm", "__aws_accountId__": "123456", "__aws_region__": "US East (Ohio)", "__cloud_watch_alert_type__": "AnomalyDetection", "__config_app__": "sls_pub_alert", "__pub_alert_app__": "{开放告警应用ID}", "__pub_alert_protocol__": "cloud_watch", "__pub_alert_region__": "{接收告警消息的网络接口对应的地域}", "__pub_alert_service__": "{开放告警服务ID}", "desc": "this is a cpu alarm", "title": "Threshold Crossed: no datapoints were received for 2 periods and 2 missing datapoints were treated as [Breaching]." }, "severity": 8, "policy": { "alert_policy_id": "{开放告警应用中配置的告警策略ID}", "action_policy_id": "{开放告警应用中配置的行动策略ID}", "use_default": false, "repeat_interval": "{开放告警应用中配置的重复等待时间}" }, "template": null, "drill_down_query": "https://us-east-2.console.aws.amazon.com/cloudwatch/home?region=us-east-2#alarmsV2:alarm/cpu%20alrm" }
日志服务告警消息内容与CloudWatch告警消息内容的映射关系如下:
日志服务字段 | CloudWatch字段 | 说明 |
---|---|---|
aliuid | 无 | 用于接入告警的开放告警应用所属的阿里云账号ID。 |
alert_id | 无 | 告警监控规则的ID。
alert_id字段值为CloudWatch_{$alert_name},其中{$alert_name}为告警监控规则的名称。 |
alert_type | 无 | 告警类型,固定为sls_pub。 |
alert_name | AlarmName | 告警监控规则的名称。 |
status | NewStateValue | 告警状态。
|
next_eval_interval |
|
告警评估间隔时间,为CloudWatch告警消息中的Period字段值和EvaluationPeriods字段值的乘积。 |
alert_time | StateChangeTime | 告警触发时间。 |
fire_time | StateChangeTime | 告警首次触发时间。 |
resolve_time | StateChangeTime | 告警恢复时间。
|
labels | 无 | 标签信息。
|
annotations | 无 | 标注信息,日志服务的annotations字段中将加入以下字段:
|
severity | NewStateValue | 告警严重度。
|
policy | 无 | 您在开放告警应用中配置的告警策略。更多信息,请参见Policy结构。 |
project | 无 | 告警中心所属的Project。更多信息,请参见项目(Project)。 |
drill_down_query | 无 | 对应CloudWatch告警的URL地址。 |
内容没看懂? 不太想学习?想快速解决? 有偿解决: 联系专家
阿里云企业补贴进行中: 马上申请
腾讯云限时活动1折起,即将结束: 马上收藏
同尘科技为腾讯云授权服务中心。
购买腾讯云产品享受折上折,更有现金返利:同意关联,立享优惠
转转请注明出处:http://www.yunxiaoer.com/161104.html