DBA不可不知的操作系统内核参数

alt

每当我找到成功的钥匙,就发现有人把锁芯给换了。

背景

操作系统为了适应更多的硬件环境,许多初始的设置值,宽容度都很高。

如果不经调整,这些值可能无法适应HPC,或者硬件稍好些的环境。

无法发挥更好的硬件性能,甚至可能影响某些应用软件的使用,特别是数据库。

数据库关心的OS内核参数

512GB 内存为例

1.

参数

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
682
683
684
685
686
687
688
689
690
691
692
693
694
695
696
697
698
699
700
701
702
703
704
705
706
707
708
709
710
711
712
713
714
715
716
717
718
719
720
721
722
723
724
725
726
727
728
729
730
731
732
733
734
735
736
737
738
739
740
741
742
743
744
745
746
747
748
749
750
751
752
753
754
755
756
757
758
759
760
761
762
763
764
765
766
767
768
769
770
771
772
773
774
775
776
777
778
779
780
781
782
783
784
785
786
787
788
789
790
791
792
793
794
795
796
797
798
799
800
801
802
803
804
805
806
807
808
809
810
811
812
813
814
815
816
817
818
819
820
821
822
823
824
825
826
827
828
829
830
831
832
833
834
835
836
837
838
839
840
841
842
843
844
845
846
847
848
849
850
851
852
853
854
855
856
857
858
859
860
861
862
863
864
865
866
867
868
869
870
871
872
873
874
875
876
877
878
879
880
881
882
883
884
885
886
887
888
889
890
891
892
893
894
895
896
897
898
899
900
901
902
903
904
905
906
907
908
909
910
911
912
913
914
915
916
917
918
919
920
921
922
923
924
925
926
927
928
929
930
931
932
933
934
935
936
937
938
939
940
941
942
943
944
945
946
947
948
949
950
951
952
953
954
955
956
957
958
959
960
961
962
963
964
965
966
967
968
969
970
971
972
973
974
975
976
977
978
979
980
981
982
983
984
985
986
987
988
989
990
991
992
993
994
995
996
997
998
999
1000
1001
1002
1003
1004
1005
1006
1007
1008
1009
1010
1011
1012
1013
1014
1015
1016
1017
1018
1019
1020
1021
1022
1023
1024
1025
1026
1027
1028
1029
1030
1031
1032
1033
1034
1035
1036
1037
1038
1039
1040
1041
1042
1043
1044
1045
1046
1047
1048
1049
1050
1051
1052
1053
1054
1055
1056
1057
1058
1059
1060
1061
1062
1063
1064
1065
1066
1067
1068
1069
1070
1071
1072
1073
1074
1075
1076
1077
1078
1079
1080
1081
1082
1083
1084
1085
1086
1087
1088
1089
1090
1091
1092
1093
1094
1095
1096
1097
1098
1099
1100
1101
1102
1103
1104
1105
1106
1107
1108
1109
1110
1111
1112
1113
1114
1115
1116
1117
1118
1119
1120
1121
1122
1123
1124
1125
1126
1127
1128
1129
1130
1131
1132
1133
1134
1135
1136
1137
1138
1139
1140
1141
1142
1143
1144
1145
1146
1147
1148
1149
1150
1151
1152
1153
1154
1155
1156
1157
1158
1159
1160
1161
1162
1163
1164
1165
1166
1167
1168
1169
1170
1171
1172
1173
1174
1175
1176
1177
1178
1179
1180
1181
1182
1183
1184
1185
1186
1187
1188
1189
1190
1191
1192
1193
1194
1195
1196
1197
1198
1199
1200
1201
1202
1203
1204
1205
1206
1207
1208
1209
1210
1211
1212
1213
1214
1215
1216
1217
1218
1219
1220
1221
1222
1223
1224
1225
1226
1227
1228
1229
1230
1231
1232
1233
1234
1235
1236
1237
1238
1239
1240
1241
1242
1243
1244
1245
1246
1247
1248
1249
1250
1251
1252
1253
1254
1255
1256
1257
1258
1259
1260
1261
1262
1263
1264
1265
1266
1267
1268
1269
1270
1271
1272
1273
1274
1275
1276
1277
1278
1279
1280
1281
1282
fs.aio-max-nr  
```

支持系统

```
CentOS 6, 7
```

参数解释

```
aio-nr & aio-max-nr:
.
aio-nr is the running total of the number of events specified on the
io_setup system call for all currently active aio contexts.
.
If aio-nr reaches aio-max-nr then io_setup will fail with EAGAIN.
.
Note that raising aio-max-nr does not result in the pre-allocation or re-sizing
of any kernel data structures.
.
aio-nr & aio-max-nr:
.
aio-nr shows the current system-wide number of asynchronous io requests.
.
aio-max-nr allows you to change the maximum value aio-nr can grow to.
```

推荐设置

```
fs.aio-max-nr = 1xxxxxx
.
PostgreSQL, Greenplum 均未使用io_setup创建aio contexts. 无需设置。
如果Oracle数据库,要使用aio的话,需要设置它。
设置它也没什么坏处,如果将来需要适应异步IO,可以不需要重新修改这个设置。
```

2\.

参数

```
fs.file-max
```

支持系统

```
CentOS 6, 7
```

参数解释

```
file-max & file-nr:
.
The value in file-max denotes the maximum number of file handles that the Linux kernel will allocate.
.
When you get lots of error messages about running out of file handles,
you might want to increase this limit.
.
Historically, the kernel was able to allocate file handles dynamically,
but not to free them again.
.
The three values in file-nr denote :
the number of allocated file handles ,
the number of allocated but unused file handles ,
the maximum number of file handles.
.
Linux 2.6 always reports 0 as the number of free
file handles -- this is not an error, it just means that the
number of allocated file handles exactly matches the number of
used file handles.
.
Attempts to allocate more file descriptors than file-max are reported with printk,
look for "VFS: file-max limit reached".
```

推荐设置

```
fs.file-max = 7xxxxxxx
.
PostgreSQL 有一套自己管理的VFS,真正打开的FD与内核管理的文件打开关闭有一套映射的机制,所以真实情况不需要使用那么多的file handlers。
max_files_per_process 参数。
假设1GB内存支撑100个连接,每个连接打开1000个文件,那么一个PG实例需要打开10万个文件,一台机器按512G内存来算可以跑500个PG实例,则需要5000万个file handler。
以上设置绰绰有余。
```

3\.

参数

```
kernel.core_pattern
```

支持系统

```
CentOS 6, 7
```

参数解释

```
core_pattern:
.
core_pattern is used to specify a core dumpfile pattern name.
. max length 128 characters; default value is "core"
. core_pattern is used as a pattern template for the output filename;
certain string patterns (beginning with '%') are substituted with
their actual values.
. backward compatibility with core_uses_pid:
If core_pattern does not include "%p" (default does not)
and core_uses_pid is set, then .PID will be appended to
the filename.
. corename format specifiers:
% '%' is dropped
%% output one '%'
%p pid
%P global pid (init PID namespace)
%i tid
%I global tid (init PID namespace)
%u uid
%g gid
%d dump mode, matches PR_SET_DUMPABLE and
/proc/sys/fs/suid_dumpable
%s signal number
%t UNIX time of dump
%h hostname
%e executable filename (may be shortened)
%E executable path
% both are dropped
. If the first character of the pattern is a '|', the kernel will treat
the rest of the pattern as a command to run. The core dump will be
written to the standard input of that program instead of to a file.
```

推荐设置

```
kernel.core_pattern = /xxx/core_%e_%u_%t_%s.%p
.
这个目录要777的权限,如果它是个软链,则真实目录需要777的权限
mkdir /xxx
chmod 777 /xxx
留足够的空间
```

4\.

参数

```
kernel.sem
```

支持系统

```
CentOS 6, 7
```

参数解释

```
kernel.sem = 4096 2147483647 2147483646 512000
.
4096 每组多少信号量 (>=17, PostgreSQL 每16个进程一组, 每组需要17个信号量) ,
2147483647 总共多少信号量 (2^31-1 , 且大于4096*512000 ) ,
2147483646 每个semop()调用支持多少操作 (2^31-1),
512000 多少组信号量 (假设每GB支持100个连接, 512GB支持51200个连接, 加上其他进程, > 51200*2/16 绰绰有余)
.
# sysctl -w kernel.sem="4096 2147483647 2147483646 512000"
.
# ipcs -s -l
------ Semaphore Limits --------
max number of arrays = 512000
max semaphores per array = 4096
max semaphores system wide = 2147483647
max ops per semop call = 2147483646
semaphore max value = 32767
```

推荐设置

```
kernel.sem = 4096 2147483647 2147483646 512000
.
4096可能能够适合更多的场景, 所以大点无妨,关键是512000 arrays也够了。
```

5\.

参数

```
kernel.shmall = 107374182
kernel.shmmax = 274877906944
kernel.shmmni = 819200
```

支持系统

```
CentOS 6, 7
```

参数解释

```
假设主机内存 512GB
.
shmmax 单个共享内存段最大 256GB (主机内存的一半,单位字节)
shmall 所有共享内存段加起来最大 (主机内存的80%,单位PAGE)
shmmni 一共允许创建819200个共享内存段 (每个数据库启动需要2个共享内存段。 将来允许动态创建共享内存段,可能需求量更大)
.
# getconf PAGE_SIZE
4096
```

推荐设置

```
kernel.shmall = 107374182
kernel.shmmax = 274877906944
kernel.shmmni = 819200
.
9.2以及以前的版本,数据库启动时,对共享内存段的内存需求非常大,需要考虑以下几点
Connections: (1800 + 270 * max_locks_per_transaction) * max_connections
Autovacuum workers: (1800 + 270 * max_locks_per_transaction) * autovacuum_max_workers
Prepared transactions: (770 + 270 * max_locks_per_transaction) * max_prepared_transactions
Shared disk buffers: (block_size + 208) * shared_buffers
WAL buffers: (wal_block_size + 8) * wal_buffers
Fixed space requirements: 770 kB
.
以上建议参数根据9.2以前的版本设置,后期的版本同样适用。
```

6\.

参数

```
net.core.netdev_max_backlog
```

支持系统

```
CentOS 6, 7
```

参数解释

```
netdev_max_backlog
------------------
Maximum number of packets, queued on the INPUT side,
when the interface receives packets faster than kernel can process them.
```

推荐设置

```
net.core.netdev_max_backlog=1xxxx
.
INPUT链表越长,处理耗费越大,如果用了iptables管理的话,需要加大这个值。
```

7\.

参数

```
net.core.rmem_default
net.core.rmem_max
net.core.wmem_default
net.core.wmem_max
```

支持系统

```
CentOS 6, 7
```

参数解释

```
rmem_default
------------
The default setting of the socket receive buffer in bytes.
.
rmem_max
--------
The maximum receive socket buffer size in bytes.
.
wmem_default
------------
The default setting (in bytes) of the socket send buffer.
.
wmem_max
--------
The maximum send socket buffer size in bytes.
```

推荐设置

```
net.core.rmem_default = 262144
net.core.rmem_max = 4194304
net.core.wmem_default = 262144
net.core.wmem_max = 4194304
```

8\.

参数

```
net.core.somaxconn
```

支持系统

```
CentOS 6, 7
```

参数解释

```
somaxconn - INTEGER
Limit of socket listen() backlog, known in userspace as SOMAXCONN.
Defaults to 128.
See also tcp_max_syn_backlog for additional tuning for TCP sockets.
```

推荐设置

```
net.core.somaxconn=4xxx
```

9\.

参数

```
net.ipv4.tcp_max_syn_backlog
```

支持系统

```
CentOS 6, 7
```

参数解释

```
tcp_max_syn_backlog - INTEGER
Maximal number of remembered connection requests, which have not
received an acknowledgment from connecting client.
The minimal value is 128 for low memory machines, and it will
increase in proportion to the memory of machine.
If server suffers from overload, try increasing this number.
```

推荐设置

```
net.ipv4.tcp_max_syn_backlog=4xxx
pgpool-II 使用了这个值,用于将超过num_init_child以外的连接queue。
所以这个值决定了有多少连接可以在队列里面等待。
```

10\.

参数

```
net.ipv4.tcp_keepalive_intvl=20
net.ipv4.tcp_keepalive_probes=3
net.ipv4.tcp_keepalive_time=60
```

支持系统

```
CentOS 6, 7
```

参数解释

```
tcp_keepalive_time - INTEGER
How often TCP sends out keepalive messages when keepalive is enabled.
Default: 2hours.
.
tcp_keepalive_probes - INTEGER
How many keepalive probes TCP sends out, until it decides that the
connection is broken. Default value: 9.
.
tcp_keepalive_intvl - INTEGER
How frequently the probes are send out. Multiplied by
tcp_keepalive_probes it is time to kill not responding connection,
after probes started. Default value: 75sec i.e. connection
will be aborted after ~11 minutes of retries.
```

推荐设置

```
net.ipv4.tcp_keepalive_intvl=20
net.ipv4.tcp_keepalive_probes=3
net.ipv4.tcp_keepalive_time=60
.
连接空闲60秒后, 每隔20秒发心跳包, 尝试3次心跳包没有响应,关闭连接。 从开始空闲,到关闭连接总共历时120秒。
```

11\.

参数

```
net.ipv4.tcp_mem=8388608 12582912 16777216
```

支持系统

```
CentOS 6, 7
```

参数解释

```
tcp_mem - vector of 3 INTEGERs: min, pressure, max
单位 page
min: below this number of pages TCP is not bothered about its
memory appetite.
.
pressure: when amount of memory allocated by TCP exceeds this number
of pages, TCP moderates its memory consumption and enters memory
pressure mode, which is exited when memory consumption falls
under "min".
.
max: number of pages allowed for queueing by all TCP sockets.
.
Defaults are calculated at boot time from amount of available
memory.
64GB 内存,自动计算的值是这样的
net.ipv4.tcp_mem = 1539615 2052821 3079230
.
512GB 内存,自动计算得到的值是这样的
net.ipv4.tcp_mem = 49621632 66162176 99243264
.
这个参数让操作系统启动时自动计算,问题也不大
```

推荐设置

```
net.ipv4.tcp_mem=8388608 12582912 16777216
.
这个参数让操作系统启动时自动计算,问题也不大
```

12\.

参数

```
net.ipv4.tcp_fin_timeout
```

支持系统

```
CentOS 6, 7
```

参数解释

```
tcp_fin_timeout - INTEGER
The length of time an orphaned (no longer referenced by any
application) connection will remain in the FIN_WAIT_2 state
before it is aborted at the local end. While a perfectly
valid "receive only" state for an un-orphaned connection, an
orphaned connection in FIN_WAIT_2 state could otherwise wait
forever for the remote to close its end of the connection.
Cf. tcp_max_orphans
Default: 60 seconds
```

推荐设置

```
net.ipv4.tcp_fin_timeout=5
.
加快僵尸连接回收速度
```

13\.

参数

```
net.ipv4.tcp_synack_retries
```

支持系统

```
CentOS 6, 7
```

参数解释

```
tcp_synack_retries - INTEGER
Number of times SYNACKs for a passive TCP connection attempt will
be retransmitted. Should not be higher than 255. Default value
is 5, which corresponds to 31seconds till the last retransmission
with the current initial RTO of 1second. With this the final timeout
for a passive TCP connection will happen after 63seconds.
```

推荐设置

```
net.ipv4.tcp_synack_retries=2
.
缩短tcp syncack超时时间
```

14\.

参数

```
net.ipv4.tcp_syncookies
```

支持系统

```
CentOS 6, 7
```

参数解释

```
tcp_syncookies - BOOLEAN
Only valid when the kernel was compiled with CONFIG_SYN_COOKIES
Send out syncookies when the syn backlog queue of a socket
overflows. This is to prevent against the common 'SYN flood attack'
Default: 1
.
Note, that syncookies is fallback facility.
It MUST NOT be used to help highly loaded servers to stand
against legal connection rate. If you see SYN flood warnings
in your logs, but investigation shows that they occur
because of overload with legal connections, you should tune
another parameters until this warning disappear.
See: tcp_max_syn_backlog, tcp_synack_retries, tcp_abort_on_overflow.
.
syncookies seriously violate TCP protocol, do not allow
to use TCP extensions, can result in serious degradation
of some services (f.e. SMTP relaying), visible not by you,
but your clients and relays, contacting you. While you see
SYN flood warnings in logs not being really flooded, your server
is seriously misconfigured.
.
If you want to test which effects syncookies have to your
network connections you can set this knob to 2 to enable
unconditionally generation of syncookies.
```

推荐设置

```
net.ipv4.tcp_syncookies=1
.
防止syn flood攻击
```

15\.

参数

```
net.ipv4.tcp_timestamps
```

支持系统

```
CentOS 6, 7
```

参数解释

```
tcp_timestamps - BOOLEAN
Enable timestamps as defined in RFC1323.
```

推荐设置

```
net.ipv4.tcp_timestamps=1
.
tcp_timestamps 是 tcp 协议中的一个扩展项,通过时间戳的方式来检测过来的包以防止 PAWS(Protect Against Wrapped Sequence numbers),可以提高 tcp 的性能。
```

16\.

参数

```
net.ipv4.tcp_tw_recycle
net.ipv4.tcp_tw_reuse
net.ipv4.tcp_max_tw_buckets
```

支持系统

```
CentOS 6, 7
```

参数解释

```
tcp_tw_recycle - BOOLEAN
Enable fast recycling TIME-WAIT sockets. Default value is 0.
It should not be changed without advice/request of technical
experts.
.
tcp_tw_reuse - BOOLEAN
Allow to reuse TIME-WAIT sockets for new connections when it is
safe from protocol viewpoint. Default value is 0.
It should not be changed without advice/request of technical
experts.
.
tcp_max_tw_buckets - INTEGER
Maximal number of timewait sockets held by system simultaneously.
If this number is exceeded time-wait socket is immediately destroyed
and warning is printed.
This limit exists only to prevent simple DoS attacks,
you _must_ not lower the limit artificially,
but rather increase it (probably, after increasing installed memory),
if network conditions require more than default value.
```

推荐设置

```
net.ipv4.tcp_tw_recycle=0
net.ipv4.tcp_tw_reuse=1
net.ipv4.tcp_max_tw_buckets = 2xxxxx
.
net.ipv4.tcp_tw_recycle和net.ipv4.tcp_timestamps不建议同时开启
```

17\.

参数

```
net.ipv4.tcp_rmem
net.ipv4.tcp_wmem
```

支持系统

```
CentOS 6, 7
```

参数解释

```
tcp_wmem - vector of 3 INTEGERs: min, default, max
min: Amount of memory reserved for send buffers for TCP sockets.
Each TCP socket has rights to use it due to fact of its birth.
Default: 1 page
.
default: initial size of send buffer used by TCP sockets. This
value overrides net.core.wmem_default used by other protocols.
It is usually lower than net.core.wmem_default.
Default: 16K
.
max: Maximal amount of memory allowed for automatically tuned
send buffers for TCP sockets. This value does not override
net.core.wmem_max. Calling setsockopt() with SO_SNDBUF disables
automatic tuning of that socket's send buffer size, in which case
this value is ignored.
Default: between 64K and 4MB, depending on RAM size.
.
tcp_rmem - vector of 3 INTEGERs: min, default, max
min: Minimal size of receive buffer used by TCP sockets.
It is guaranteed to each TCP socket, even under moderate memory
pressure.
Default: 1 page
.
default: initial size of receive buffer used by TCP sockets.
This value overrides net.core.rmem_default used by other protocols.
Default: 87380 bytes. This value results in window of 65535 with
default setting of tcp_adv_win_scale and tcp_app_win:0 and a bit
less for default tcp_app_win. See below about these variables.
.
max: maximal size of receive buffer allowed for automatically
selected receiver buffers for TCP socket. This value does not override
net.core.rmem_max. Calling setsockopt() with SO_RCVBUF disables
automatic tuning of that socket's receive buffer size, in which
case this value is ignored.
Default: between 87380B and 6MB, depending on RAM size.
```

推荐设置

```
net.ipv4.tcp_rmem=8192 87380 16777216
net.ipv4.tcp_wmem=8192 65536 16777216
.
许多数据库的推荐设置,提高网络性能
```

18\.

参数

```
net.nf_conntrack_max
net.netfilter.nf_conntrack_max
```

支持系统

```
CentOS 6
```

参数解释

```
nf_conntrack_max - INTEGER
Size of connection tracking table.
Default value is nf_conntrack_buckets value * 4.
```

推荐设置

```
net.nf_conntrack_max=1xxxxxx
net.netfilter.nf_conntrack_max=1xxxxxx
```

19\.

参数

```
vm.dirty_background_bytes
vm.dirty_expire_centisecs
vm.dirty_ratio
vm.dirty_writeback_centisecs
```

支持系统

```
CentOS 6, 7
```

参数解释

```
==============================================================
.
dirty_background_bytes
.
Contains the amount of dirty memory at which the background kernel
flusher threads will start writeback.
.
Note: dirty_background_bytes is the counterpart of dirty_background_ratio. Only
one of them may be specified at a time. When one sysctl is written it is
immediately taken into account to evaluate the dirty memory limits and the
other appears as 0 when read.
.
==============================================================
.
dirty_background_ratio
.
Contains, as a percentage of total system memory, the number of pages at which
the background kernel flusher threads will start writing out dirty data.
.
==============================================================
.
dirty_bytes
.
Contains the amount of dirty memory at which a process generating disk writes
will itself start writeback.
.
Note: dirty_bytes is the counterpart of dirty_ratio. Only one of them may be
specified at a time. When one sysctl is written it is immediately taken into
account to evaluate the dirty memory limits and the other appears as 0 when
read.
.
Note: the minimum value allowed for dirty_bytes is two pages (in bytes); any
value lower than this limit will be ignored and the old configuration will be
retained.
.
==============================================================
.
dirty_expire_centisecs
.
This tunable is used to define when dirty data is old enough to be eligible
for writeout by the kernel flusher threads. It is expressed in 100'ths
of a second. Data which has been dirty in-memory for longer than this
interval will be written out next time a flusher thread wakes up.
.
==============================================================
.
dirty_ratio
.
Contains, as a percentage of total system memory, the number of pages at which
a process which is generating disk writes will itself start writing out dirty
data.
.
==============================================================
.
dirty_writeback_centisecs
.
The kernel flusher threads will periodically wake up and write `old' data
out to disk. This tunable expresses the interval between those wakeups, in
100'ths of a second.
.
Setting this to zero disables periodic writeback altogether.
.
==============================================================
```

推荐设置

```
vm.dirty_background_bytes = 4096000000
vm.dirty_expire_centisecs = 6000
vm.dirty_ratio = 80
vm.dirty_writeback_centisecs = 50
.
减少数据库进程刷脏页的频率,dirty_background_bytes根据实际IOPS能力以及内存大小设置
```

20\.

参数

```
vm.extra_free_kbytes
```

支持系统

```
CentOS 6
```

参数解释

```
extra_free_kbytes
.
This parameter tells the VM to keep extra free memory
between the threshold where background reclaim (kswapd) kicks in,
and the threshold where direct reclaim (by allocating processes) kicks in.
.
This is useful for workloads that require low latency memory allocations
and have a bounded burstiness in memory allocations,
for example a realtime application that receives and transmits network traffic
(causing in-kernel memory allocations) with a maximum total message burst
size of 200MB may need 200MB of extra free memory to avoid direct reclaim
related latencies.
.
目标是尽量让后台进程回收内存,比用户进程提早多少kbytes回收,因此用户进程可以快速分配内存。
```

推荐设置

```
vm.extra_free_kbytes=4xxxxxx
```

21\.

参数

```
vm.min_free_kbytes
```

支持系统

```
CentOS 6, 7
```

参数解释

```
min_free_kbytes:
.
This is used to force the Linux VM to keep a minimum number
of kilobytes free. The VM uses this number to compute a
watermark[WMARK_MIN] value for each lowmem zone in the system.
Each lowmem zone gets a number of reserved free pages based
proportionally on its size.
.
Some minimal amount of memory is needed to satisfy PF_MEMALLOC
allocations; if you set this to lower than 1024KB, your system will
become subtly broken, and prone to deadlock under high loads.
.
Setting this too high will OOM your machine instantly.
```

推荐设置

```
vm.min_free_kbytes = 2xxxxxx # vm.min_free_kbytes 建议每32G内存分配1G vm.min_free_kbytes
.
防止在高负载时系统无响应,减少内存分配死锁概率。
```

22\.

参数

```
vm.mmap_min_addr
```

支持系统

```
CentOS 6, 7
```

参数解释

```
mmap_min_addr
.
This file indicates the amount of address space which a user process will
be restricted from mmapping. Since kernel null dereference bugs could
accidentally operate based on the information in the first couple of pages
of memory userspace processes should not be allowed to write to them. By
default this value is set to 0 and no protections will be enforced by the
security module. Setting this value to something like 64k will allow the
vast majority of applications to work correctly and provide defense in depth
against future potential kernel bugs.
```

推荐设置

```
vm.mmap_min_addr=6xxxx
.
防止内核隐藏的BUG导致的问题
```

23\.

参数

```
vm.overcommit_memory
vm.overcommit_ratio
```

支持系统

```
CentOS 6, 7
```

参数解释

```
==============================================================
.
overcommit_kbytes:
.
When overcommit_memory is set to 2, the committed address space is not
permitted to exceed swap plus this amount of physical RAM. See below.
.
Note: overcommit_kbytes is the counterpart of overcommit_ratio. Only one
of them may be specified at a time. Setting one disables the other (which
then appears as 0 when read).
.
==============================================================
.
overcommit_memory:
.
This value contains a flag that enables memory overcommitment.
.
When this flag is 0,
the kernel attempts to estimate the amount
of free memory left when userspace requests more memory.
.
When this flag is 1,
the kernel pretends there is always enough memory until it actually runs out.
.
When this flag is 2,
the kernel uses a "never overcommit"
policy that attempts to prevent any overcommit of memory.
Note that user_reserve_kbytes affects this policy.
.
This feature can be very useful because there are a lot of
programs that malloc() huge amounts of memory "just-in-case"
and don't use much of it.
.
The default value is 0.
.
See Documentation/vm/overcommit-accounting and
security/commoncap.c::cap_vm_enough_memory() for more information.
.
==============================================================
.
overcommit_ratio:
.
When overcommit_memory is set to 2,
the committed address space is not permitted to exceed
swap + this percentage of physical RAM.
See above.
.
==============================================================
```

推荐设置

```
vm.overcommit_memory = 0
vm.overcommit_ratio = 90
.
vm.overcommit_memory = 0 时 vm.overcommit_ratio可以不设置
```

24\.

参数

```
vm.swappiness
```

支持系统

```
CentOS 6, 7
```

参数解释

```
swappiness
.
This control is used to define how aggressive the kernel will swap
memory pages.
Higher values will increase agressiveness, lower values
decrease the amount of swap.
.
The default value is 60.
```

推荐设置

```
vm.swappiness = 0
```

25\.

参数

```
vm.zone_reclaim_mode
```

支持系统

```
CentOS 6, 7
```

参数解释

```
zone_reclaim_mode:
.
Zone_reclaim_mode allows someone to set more or less aggressive approaches to
reclaim memory when a zone runs out of memory. If it is set to zero then no
zone reclaim occurs. Allocations will be satisfied from other zones / nodes
in the system.
.
This is value ORed together of
.
1 = Zone reclaim on
2 = Zone reclaim writes dirty pages out
4 = Zone reclaim swaps pages
.
zone_reclaim_mode is disabled by default. For file servers or workloads
that benefit from having their data cached, zone_reclaim_mode should be
left disabled as the caching effect is likely to be more important than
data locality.
.
zone_reclaim may be enabled if it's known that the workload is partitioned
such that each partition fits within a NUMA node and that accessing remote
memory would cause a measurable performance reduction. The page allocator
will then reclaim easily reusable pages (those page cache pages that are
currently not used) before allocating off node pages.
.
Allowing zone reclaim to write out pages stops processes that are
writing large amounts of data from dirtying pages on other nodes. Zone
reclaim will write out dirty pages if a zone fills up and so effectively
throttle the process. This may decrease the performance of a single process
since it cannot use all of system memory to buffer the outgoing writes
anymore but it preserve the memory on other nodes so that the performance
of other processes running on other nodes will not be affected.
.
Allowing regular swap effectively restricts allocations to the local
node unless explicitly overridden by memory policies or cpuset
configurations.
```

推荐设置

```
vm.zone_reclaim_mode=0
.
不使用NUMA
```

26\.

参数

```
net.ipv4.ip_local_port_range
```

支持系统

```
CentOS 6, 7
```

参数解释

```
ip_local_port_range - 2 INTEGERS
Defines the local port range that is used by TCP and UDP to
choose the local port. The first number is the first, the
second the last local port number. The default values are
32768 and 61000 respectively.
.
ip_local_reserved_ports - list of comma separated ranges
Specify the ports which are reserved for known third-party
applications. These ports will not be used by automatic port
assignments (e.g. when calling connect() or bind() with port
number 0). Explicit port allocation behavior is unchanged.
.
The format used for both input and output is a comma separated
list of ranges (e.g. "1,2-4,10-10" for ports 1, 2, 3, 4 and
10). Writing to the file will clear all previously reserved
ports and update the current list with the one given in the
input.
.
Note that ip_local_port_range and ip_local_reserved_ports
settings are independent and both are considered by the kernel
when determining which ports are available for automatic port
assignments.
.
You can reserve ports which are not in the current
ip_local_port_range, e.g.:
.
$ cat /proc/sys/net/ipv4/ip_local_port_range
32000 61000
$ cat /proc/sys/net/ipv4/ip_local_reserved_ports
8080,9148
.
although this is redundant. However such a setting is useful
if later the port range is changed to a value that will
include the reserved ports.
.
Default: Empty
```

推荐设置

```
net.ipv4.ip_local_port_range=40000 65535
.
限制本地动态端口分配范围,防止占用监听端口。
```

27\.

参数

```
vm.nr_hugepages
```

支持系统

```
CentOS 6, 7
```

参数解释

```
==============================================================
nr_hugepages
Change the minimum size of the hugepage pool.
See Documentation/vm/hugetlbpage.txt
==============================================================
nr_overcommit_hugepages
Change the maximum size of the hugepage pool. The maximum is
nr_hugepages + nr_overcommit_hugepages.
See Documentation/vm/hugetlbpage.txt
.
The output of "cat /proc/meminfo" will include lines like:
......
HugePages_Total: vvv
HugePages_Free: www
HugePages_Rsvd: xxx
HugePages_Surp: yyy
Hugepagesize: zzz kB
.
where:
HugePages_Total is the size of the pool of huge pages.
HugePages_Free is the number of huge pages in the pool that are not yet
allocated.
HugePages_Rsvd is short for "reserved," and is the number of huge pages for
which a commitment to allocate from the pool has been made,
but no allocation has yet been made. Reserved huge pages
guarantee that an application will be able to allocate a
huge page from the pool of huge pages at fault time.
HugePages_Surp is short for "surplus," and is the number of huge pages in
the pool above the value in /proc/sys/vm/nr_hugepages. The
maximum number of surplus huge pages is controlled by
/proc/sys/vm/nr_overcommit_hugepages.
.
/proc/filesystems should also show a filesystem of type "hugetlbfs" configured
in the kernel.
.
/proc/sys/vm/nr_hugepages indicates the current number of "persistent" huge
pages in the kernel's huge page pool. "Persistent" huge pages will be
returned to the huge page pool when freed by a task. A user with root
privileges can dynamically allocate more or free some persistent huge pages
by increasing or decreasing the value of 'nr_hugepages'.
```

推荐设置

```
如果要使用PostgreSQL的huge page,建议设置它。
大于数据库需要的共享内存即可。
```

28\.

参数

fs.nr_open

1
2

支持系统

CentOS 6, 7

1
2

参数解释

nr_open:

This denotes the maximum number of file-handles a process can
allocate. Default value is 1024*1024 (1048576) which should be
enough for most machines. Actual limit depends on RLIMIT_NOFILE
resource limit.

它还影响security/limits.conf 的文件句柄限制,单个进程的打开句柄不能大于fs.nr_open,所以要加大文件句柄限制,首先要加大nr_open


推荐设置

对于有很多对象(表、视图、索引、序列、物化视图等)的PostgreSQL数据库,建议设置为2000万,
例如fs.nr_open=20480000


## 数据库关心的资源限制  
1\. 通过/etc/security/limits.conf设置,或者ulimit设置    

2\. 通过/proc/$pid/limits查看当前进程的设置    

- core - limits the core file size (KB)

- memlock - max locked-in-memory address space (KB)

- nofile - max number of open files 建议设置为1000万 , 但是必须设置sysctl, fs.nr_open大于它,否则会导致系统无法登陆。

- nproc - max number of processes

以上四个是非常关心的配置
….

- data - max data size (KB)

- fsize - maximum filesize (KB)

- rss - max resident set size (KB)

- stack - max stack size (KB)

- cpu - max CPU time (MIN)

- as - address space limit (KB)

- maxlogins - max number of logins for this user

- maxsyslogins - max number of logins on the system

- priority - the priority to run user process with

- locks - max number of file locks the user can hold

- sigpending - max number of pending signals

- msgqueue - max memory used by POSIX message queues (bytes)

- nice - max nice priority allowed to raise to values: [-20, 19]

- rtprio - max realtime priority


## 数据库关心的IO调度规则  
1\. 目前操作系统支持的IO调度策略包括cfq, deadline, noop 等。    

/kernel-doc-xxx/Documentation/block
-r–r–r– 1 root root 674 Apr 8 16:33 00-INDEX
-r–r–r– 1 root root 55006 Apr 8 16:33 biodoc.txt
-r–r–r– 1 root root 618 Apr 8 16:33 capability.txt
-r–r–r– 1 root root 12791 Apr 8 16:33 cfq-iosched.txt
-r–r–r– 1 root root 13815 Apr 8 16:33 data-integrity.txt
-r–r–r– 1 root root 2841 Apr 8 16:33 deadline-iosched.txt
-r–r–r– 1 root root 4713 Apr 8 16:33 ioprio.txt
-r–r–r– 1 root root 2535 Apr 8 16:33 null_blk.txt
-r–r–r– 1 root root 4896 Apr 8 16:33 queue-sysfs.txt
-r–r–r– 1 root root 2075 Apr 8 16:33 request.txt
-r–r–r– 1 root root 3272 Apr 8 16:33 stat.txt
-r–r–r– 1 root root 1414 Apr 8 16:33 switching-sched.txt
-r–r–r– 1 root root 3916 Apr 8 16:33 writeback_cache_control.txt


如果你要详细了解这些调度策略的规则,可以查看WIKI或者看内核文档。    

从这里可以看到它的调度策略  

cat /sys/block/vdb/queue/scheduler
noop [deadline] cfq


修改    

echo deadline > /sys/block/hda/queue/scheduler


或者修改启动参数    

grub.conf
elevator=deadline
`

从很多测试结果来看,数据库使用deadline调度,性能会更稳定一些。

其他

1. 关闭透明大页

2. 禁用NUMA

3. SSD的对齐

-------------本文结束感谢您的阅读-------------
0%