i通过ipmitool获取各元件的温度信息和检查电源模块状态

前言

ipmitool可以获取各个元件的温度信息,如何判断各个组件的温度信息,各个组件的温度信息是否OK,有没有温度过高或者过低的元件需要告警?

获取各个元件温度的方法

我们可以通过如下指令获取所有元件的温度信息和相关的状态

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
root@node:~# ipmitool sensor list 
CPU1 Temp | 29.000 | degrees C | ok | 0.000 | 0.000 | 0.000 | 85.000 | 90.000 | 90.000
CPU2 Temp | 33.000 | degrees C | nr | 10.000 | 10.000 | 10.000 | 30.000 | 30.000 | 30.000
PCH Temp | 32.000 | degrees C | ok | 0.000 | 5.000 | 16.000 | 90.000 | 95.000 | 100.000
System Temp | 30.000 | degrees C | ok | -10.000 | -5.000 | 0.000 | 80.000 | 85.000 | 90.000
Peripheral Temp | 34.000 | degrees C | ok | -10.000 | -5.000 | 0.000 | 80.000 | 85.000 | 90.000
Vcpu1VRM Temp | 28.000 | degrees C | ok | -5.000 | 0.000 | 5.000 | 95.000 | 100.000 | 105.000
Vcpu2VRM Temp | 34.000 | degrees C | ok | -5.000 | 0.000 | 5.000 | 95.000 | 100.000 | 105.000
VmemABVRM Temp | 29.000 | degrees C | ok | -5.000 | 0.000 | 5.000 | 95.000 | 100.000 | 105.000
VmemCDVRM Temp | 28.000 | degrees C | ok | -5.000 | 0.000 | 5.000 | 95.000 | 100.000 | 105.000
VmemEFVRM Temp | 31.000 | degrees C | ok | -5.000 | 0.000 | 5.000 | 95.000 | 100.000 | 105.000
VmemGHVRM Temp | 30.000 | degrees C | ok | -5.000 | 0.000 | 5.000 | 95.000 | 100.000 | 105.000
P1-DIMMA1 Temp | 27.000 | degrees C | ok | -5.000 | 0.000 | 5.000 | 80.000 | 85.000 | 90.000
P1-DIMMA2 Temp | na | | na | na | na | na | na | na | na
P1-DIMMB1 Temp | 27.000 | degrees C | ok | -5.000 | 0.000 | 5.000 | 80.000 | 85.000 | 90.000
P1-DIMMB2 Temp | na | | na | na | na | na | na | na | na
P1-DIMMC1 Temp | na | | na | na | na | na | na | na | na
P1-DIMMC2 Temp | na | | na | na | na | na | na | na | na
P1-DIMMD1 Temp | na | | na | na | na | na | na | na | na
P1-DIMMD2 Temp | na | | na | na | na | na | na | na | na
P2-DIMME1 Temp | 29.000 | degrees C | ok | -5.000 | 0.000 | 5.000 | 80.000 | 85.000 | 90.000
P2-DIMME2 Temp | na | | na | na | na | na | na | na | na
P2-DIMMF1 Temp | 30.000 | degrees C | ok | -5.000 | 0.000 | 5.000 | 80.000 | 85.000 | 90.000
P2-DIMMF2 Temp | na | | na | na | na | na | na | na | na
P2-DIMMG1 Temp | na | | na | na | na | na | na | na | na
P2-DIMMG2 Temp | na | | na | na | na | na | na | na | na
P2-DIMMH1 Temp | na | | na | na | na | na | na | na | na
P2-DIMMH2 Temp | na | | na | na | na | na | na | na | na
FAN1 | 4400.000 | RPM | ok | 300.000 | 500.000 | 700.000 | 25300.000 | 25400.000 | 25500.000
FAN2 | 4300.000 | RPM | ok | 300.000 | 500.000 | 700.000 | 25300.000 | 25400.000 | 25500.000
FAN3 | 4400.000 | RPM | ok | 300.000 | 500.000 | 700.000 | 25300.000 | 25400.000 | 25500.000
FAN4 | na | | na | na | na | na | na | na | na
FAN5 | na | | na | na | na | na | na | na | na
FAN6 | na | | na | na | na | na | na | na | na
FANA | 4400.000 | RPM | ok | 300.000 | 500.000 | 700.000 | 25300.000 | 25400.000 | 25500.000
FANB | na | | na | na | na | na | na | na | na
12V | 12.315 | Volts | ok | 10.173 | 10.299 | 10.740 | 12.945 | 13.260 | 13.386
5VCC | 5.000 | Volts | ok | 4.246 | 4.298 | 4.480 | 5.390 | 5.546 | 5.598
3.3VCC | 3.316 | Volts | ok | 2.789 | 2.823 | 2.959 | 3.554 | 3.656 | 3.690
VBAT | 3.104 | Volts | ok | 2.376 | 2.480 | 2.584 | 3.494 | 3.598 | 3.676
Vcpu1 | 1.800 | Volts | ok | 1.242 | 1.260 | 1.395 | 1.899 | 2.088 | 2.106
Vcpu2 | 1.809 | Volts | ok | 1.242 | 1.260 | 1.395 | 1.899 | 2.088 | 2.106
VDIMMAB | 1.200 | Volts | ok | 0.948 | 0.975 | 1.047 | 1.344 | 1.425 | 1.443
VDIMMCD | 1.209 | Volts | ok | 0.948 | 0.975 | 1.047 | 1.344 | 1.425 | 1.443
VDIMMEF | 1.209 | Volts | ok | 0.948 | 0.975 | 1.047 | 1.344 | 1.425 | 1.443
VDIMMGH | 1.209 | Volts | ok | 0.948 | 0.975 | 1.047 | 1.344 | 1.425 | 1.443
5VSB | 4.974 | Volts | ok | 4.246 | 4.298 | 4.480 | 5.390 | 5.546 | 5.598
3.3VSB | 3.316 | Volts | ok | 2.789 | 2.823 | 2.959 | 3.554 | 3.656 | 3.690
1.5V PCH | 1.509 | Volts | ok | 1.320 | 1.347 | 1.401 | 1.644 | 1.671 | 1.698
1.2V BMC | 1.209 | Volts | ok | 1.020 | 1.047 | 1.092 | 1.344 | 1.371 | 1.398
1.05V PCH | 1.050 | Volts | ok | 0.870 | 0.897 | 0.942 | 1.194 | 1.221 | 1.248
Chassis Intru | 0x0 | discrete | 0x0000| na | na | na | na | na | na
PS1 Status | 0x1 | discrete | 0x0100| na | na | na | na | na | na
PS2 Status | 0x1 | discrete | 0x0100| na | na | na | na | na | na
AOC_SAS Temp | 60.000 | degrees C | ok | -11.000 | -8.000 | -5.000 | 100.000 | 105.000 | 110.000
HDD Temp | 29.000 | degrees C | ok | -11.000 | -8.000 | -5.000 | 50.000 | 55.000 | 60.000
HDD Status | 0x1 | discrete | 0x01ff| na | na | na | na | na | na

一般来讲,第三列的值中有degree的,我们统计的是温度信息。

  • 第一列: 传感器的名称,比如CPU1 Temp,
  • 第二列: 该元件的当前温度值,注意有时候会是na,即取不到。
  • 第四列: 温度的状态信息,ok表示温度正常,有时候该状态值为nr,为non-recovery,不可恢复的意思

一般来讲,常见的温度状态有以下5种:

  • ok:温度正常
  • nc: non-critical,温度偏高(或者偏低),但是并不太严重
  • cr:critical,温度太高或者温度太低,很严重
  • nr: non-recovery,温度太高或者温度太低,造成不可恢复的损伤。
  • na:温度状态不明,比较少见。

    注意ok –> nc –> cr –> nr 从正常,到越来越严重的温度问题。

如何触发温度告警

介绍了nc cr 和nr三种状态,都说温度偏高或者温度偏低,那么

  • 温度到什么程度状态会变成nc,
  • 温度到什么程度会变成cr
  • 温度到什么程度会变成nr

显然,各个元件的状态改变是有温度门限值的,我们可以通过如下方法查看:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
root@node:~# ipmitool sensor get "CPU1 Temp"
Locating sensor record...
Sensor ID : CPU1 Temp (0x1)
Entity ID : 3.1 (Processor)
Sensor Type (Threshold) : Temperature (0x01)
Sensor Reading : 29 (+/- 0) degrees C
Status : ok
Nominal Reading : 40.000
Normal Minimum : -4.000
Normal Maximum : 89.000
Upper non-recoverable : 90.000
Upper critical : 90.000
Upper non-critical : 85.000
Lower non-recoverable : 0.000
Lower critical : 0.000
Lower non-critical : 0.000
Positive Hysteresis : 2.000
Negative Hysteresis : 2.000
Minimum sensor range : Unspecified
Maximum sensor range : Unspecified
Event Message Control : Per-threshold
Readable Thresholds : lnr lcr lnc unc ucr unr
Settable Thresholds : lnr lcr lnc unc ucr unr
Threshold Read Mask : lnr lcr lnc unc ucr unr
Assertion Events :
Assertions Enabled : ucr+
Deassertions Enabled : ucr+

从上面的信息可以看出:

  • Upper non-critical 85 度
  • Upper critical 90 度
  • Upper non-recovery 90 度
  • Lower non-critical 0 度
  • Lower critical 0 度
  • Lower non-recoverable

有了门限值,是哪种状态就比较简单了。

  • [0,85]之间是,状态ok
  • [85,90] 状态为nc
  • [90,] 状态为nr (因为cr的门限和nr的门限都是90,状态取nr)

低温的情况也是类似。

如何让温度状态告警呢,即变成nc或者cr或者nr状态呢?

ipmitool提供了方法来设置各个状态的门限值。

1
ipmitool -I open sensor thresh 'CPU2 Temp' upper 20 30 90

上述指令的意思是将CPU2 Temp元件的告警门限中的温度上限告警门限设置为20 30 和90.

以为CPU的温度是33度左右,我们可以通过如下指令,将状态变为nc:

1
2
3
4
5
root@node:~# ipmitool -I open sensor thresh 'CPU2 Temp' upper 20 40 90
Locating sensor record 'CPU2 Temp'...
Setting sensor "CPU2 Temp" Upper Non-Critical threshold to 20.000
Setting sensor "CPU2 Temp" Upper Critical threshold to 40.000
Setting sensor "CPU2 Temp" Upper Non-Recoverable threshold to 90.000

1
2
root@node:~# ipmitool sensor list
CPU2 Temp | 33.000 | degrees C | nc | 10.000 | 10.000 | 10.000 | 20.000 | 40.000 | 90.000

33摄氏度,超过了20度,但是没要超过40度,因此状态是nc,即non-critical。

同样道理,我们将告警门限设置为 20 30 90的话,就会发现状态为cr,即critical:

1
2
3
4
5
6
7
8
9
root@node:~# ipmitool -I open sensor thresh 'CPU2 Temp' upper 20 30 90
Locating sensor record 'CPU2 Temp'...
Setting sensor "CPU2 Temp" Upper Non-Critical threshold to 20.000
Setting sensor "CPU2 Temp" Upper Critical threshold to 30.000
Setting sensor "CPU2 Temp" Upper Non-Recoverable threshold to 90.000

root@node:~# ipmitool sensor list
CPU1 Temp | 30.000 | degrees C | ok | 0.000 | 0.000 | 0.000 | 85.000 | 90.000 | 90.000
CPU2 Temp | 33.000 | degrees C | cr | 10.000 | 10.000 | 10.000 | 20.000 | 30.000 | 90.000

同样道理,可以将状态变成nr,只需要设置门限为20 30 30 ,即可。

检查电源模块状态

我们知道IPMI很强大,如何利用ipmitool获取到电源的实施状态的。现代的服务器,基本上都有两个电源模块,作为冗余。如何查看电源的状态信息呢,是否所有的电源模块都已启用,电源是否都通电?

方法一

通过如下指令可以获取到电源的状态信息:

1
ipmitool sdr type "power supply"

正常情况下电源的状态如下所示

1
2
PS1 Status       | C4h | ok  | 10.1 | Presence detected
PS2 Status | C5h | ok | 10.2 | Presence detected

如果,我们将其中一个拔掉电源插头,状态就会如下所示:

1
2
PS1 Status       | C4h | ok  | 10.1 | Presence detected
PS2 Status | C5h | ok | 10.2 | Presence detected, Failure detected, Power Supply AC lost

如果我们将其中一个的电源模块(PSU, power supply unit)直接从服务器上拔出,状态就会如下所示:

1
2
PS1 Status       | C4h | ok  | 10.1 | Presence detected
PS2 Status | C5h | ok | 10.2 |

事实上除了上面的几种,我们可以通过

1
ipmitool sensor get "PS1 Status"

查看其他可能的值

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
Sensor ID              : PS1 Status (0xc4)
Entity ID : 10.1 (Power Supply)
Sensor Type (Discrete): Power Supply (0x08)
Sensor Reading : 1h <-------------这个值是方法2提到0x01 ,即正常状态< span>
Event Message Control : Per-threshold
States Asserted : Power Supply
[Presence detected]
Assertion Events : Power Supply
[Presence detected]
Deassertion Events : Power Supply
[Failure detected]
Assertions Enabled : Power Supply
[Failure detected]
[Power Supply AC lost]
[AC lost or out-of-range]
[AC out-of-range, but present]
[Config Error]
Deassertions Enabled : Power Supply
[Failure detected]
[Power Supply AC lost]
[AC lost or out-of-range]
[AC out-of-range, but present]
[Config Error]
OEM : 0

方法二

从如下指令也可以获得电源模块的信息:

1
ipmitool sensor list

输出如下:

1
2
PS1 Status       | 0x1        | discrete   | 0x0100| na        | na        | na        | na        | na        | na        
PS2 Status | 0x1 | discrete | 0x0100| na | na | na | na | na | na

第二列的值很有意思

  • 0x01 status ok,最常见的状态
  • 0x00 power supply unit not present ,即电源模块不存在,一般电源模块从服务器中拔出,状态是0x00
  • 0x03 power supply off or failed,我没有遇到过这种状态,我猜是电源模块坏了的时候,会是这种状态
  • 0x0b input out of range(ex. No AC input)这也是很常见的状态,把电源的插头拔掉,就会是这种状态。

这种方法比较好,个人比较推荐这种方法。

参考文献

IPMItool Sometimes Fails to Detect Power Supply Failures on Rackable Nodes

-------------本文结束感谢您的阅读-------------
0%