Performance Monitoring of Red Hat Satellite 6 using satperf

Continuous time-series metric collection of Satellite  & all Capsules are essential while satellite running at scale.
This post is helps to configure & monitor metrics using satellite-performance
1) Tools:
  • Collectd – Daemon to collect System Performance Statistics
    • Collects CPU, Memory, Disk, Network, Per Process stats (Regex), Postgresql, mongodb, turbostat, qpid, foreman, DynFlow, Passenger, Puppet, Tomcat, collectd..etc
  • Graphite/Carbon
    • Carbon receives metrics, and flushes them to whisper database files
    • Graphite is webapp frontend to carbon
  • Grafana – Visualize metrics from multiple backends.
    • Dashboards saved in json and customized by Ansible during deployment

2) Architecture

monitoring

3) How do i configure performance?

Archit has come up with a nice blog for configuration

Description of metrics collected in satperf:
http://arcolife.github.io/blog/2016/10/05/monitoring-in-satperf-metrics-collection

4)  Example Graphs
4.1 )  Passenger Mem
foreman
4.2) Postgresql DB (candlepin & foreman)
db
4.3) Candlepin DB
candlepindb
4.4) Puppet Registrations
pupper_registrations
4.2) Dynflow Mem
dynflow
Thanks to Archit, Jhutar for providing inputs & help!

RedHat Satellite 6.2 Considerations for Large scale deployments

Red Hat Satellite is a complete system management product that allows system administrators to manage the full life cycle of Red Hat deployments across physical, virtual, and private clouds. Red Hat Satellite delivers system provisioning, configuration management, software management, and subscription management- all while maintaining high scalability and security.  Satellite 6.2 is third major release of the next generation Satellite with a raft of improvements that continue to narrow the gaps in functionality found in Satellite 5 in many critical areas of the product.  This Blog  provides basic guidelines and considerations for tuning Red Hat Satellite 6.2 & capsule for Large scale deployments

1) Increase open-files-limit for Apache with systemd on satellite & Capsule server

# cat /etc/systemd/system/httpd.service.d/limits.conf

[Service]

LimitNOFILE=1000000

# systemctl daemon-reload

# katello-service restart

2) Increase open-files-limit for Qpid with systemd on satellite & Capsule server

# cat /etc/systemd/system/qpidd.service.d/limits.conf

[Service]

LimitNOFILE=1000000

# systemctl daemon-reload

# katello-service restart

3) Increase postgresql shared_buffer

While registering content hosts at scale to Satellite server, shared_buffers needs to be set appropriately in postgresql.conf. Recommended: 256 MB

4) Increase postgresql max_connections

When registering content hosts at scale, it is recommended to increase max_connections setting (set to 100 by default) as per your needs and HW profile. For example, you might need to set the value to 200 when you are registering 200 content hosts in parallel.

5) Storage planning for qpid

When you use katello-agent extensively, plan storage capacity for /var/lib/qpidd in advance. Currently, in Satellite 6.2 /var/lib/qpidd requires 2MB disk space per a content hos.

6) Increase open-files-limit for Qpid Dispatch Router with systemd on satellite & Capsule server

# cat /etc/systemd/system/qdrouterd.service.d/limits.conf

[Service]

LimitNOFILE=1000000

# systemctl daemon-reload

# katello-service restart

 

Special Thanks to Jan Jutar & Archit Sarma for the help to get scale numbers.

 

 

 

 

 

 

 

External Snapshot of raw images

When external snapshot of raw image is taken, delta is taken into qcow2 files.

virsh # list
Id    Name                           State
—————————————————-
4     cbtool                         running
6     master                         running

virsh # snapshot-create-as master snap1-master “snap1” –diskspec vda,file=/home/snap1.qcow2   –disk-only –atomic

Domain snapshot snap2-master created

snapshots tree :

virsh # snapshot-list  master –tree
snap1-master

virsh # snapshot-create-as master snap2-master “snap2” –diskspec vda,file=/home/snap2.qcow2   –disk-only  –atomic
Domain snapshot snap2-master created
virsh # snapshot-list  master –tree
snap1-master
|
+- snap2-master

Image info:

qemu-img info  /home/snap2.qcow2
image: /home/snap2.qcow2
file format: qcow2
virtual size: 10G (10737418240 bytes)
disk size: 196K
cluster_size: 65536
backing file: /home/snap1.qcow2
backing file format: qcow2
Format specific information:
compat: 1.1
lazy refcounts: false
refcount bits: 16
corrupt: false

How to Delete:

virsh # snapshot-list master
Name                 Creation Time             State
————————————————————
snap2-master         2016-01-07 03:38:10 -0500 disk-snapshot

virsh # snapshot-delete master snap2-master –metadata
Domain snapshot snap2-master deleted

 

 

 

Starting MongoDB on CentOS with NUMA disabled

Have been noticing this every time i run mongodb on centos/RHEL or any other.

Error:
Sun Dec 20 06:26:16.832 [initandlisten] ** WARNING: You are running on a NUMA machine.
Sun Dec 20 06:26:16.832 [initandlisten] ** We suggest launching mongod like this to avoid performance problems:
Sun Dec 20 06:26:16.832 [initandlisten] ** numactl –interleave=all mongod [other options]
Sun Dec 20 06:26:16.832 [initandlisten]

TO avoid performance issues, its recommended to run mongodb with interleaving memory across all NUMA nodes.

Dint find any way to solve this.
Kill existing mongod and restart mongodb with below.

numactl –interleave=all runuser -s /bin/bash mongodb -c “/usr/bin/mongod –dbpath /var/lib/mongodb”

 

Number of io requests for each io_submit

Async io use io_submit calls. aio=native is used for async io. To get number of io requests for each io_submit from KVM VM,
here you go. My seq write 4k run on SSD. While capturing IOPS, make sure to trace io_submit perf events using below. 
sys_enter_io_submit, sys_exit_io_submit are mandatory. Essentially each io_submit call followed by *_io_getevents
which are irrelevant to present topic. 

 -e syscalls:sys_enter_io_submit -e syscalls:sys_exit_io_submit -e syscalls:sys_enter_io_getevents -e syscalls:sys_exit_io_getevents

Get Iops for this run.

write-4KiBIOPS 76971.9

Get Number of io_submits from captured perf.data. Number of enter_io_submits are fine. Ofcourse same number of exits will be there

[root@perf io-submit-write-4k]# perf script | grep io_submit | grep enter | wc -l
493370
[root@perf io-submit-write-4k]#
Get timestamp of io_submit events. (first and last)
First: 
qemu-kvm  3693 [025]  1914.589390: syscalls:sys_enter_io_submit: ctx_id: 0x7f3f18a61000, nr: 0x000000d1, iocbpp:
                     697 io_submit (/usr/lib64/libaio.so.1.0.1)
                       8 [unknown] ([unknown])
                       0 [unknown] ([unknown])

Last: 

qemu-kvm  3693 [001]  1949.737723: syscalls:sys_enter_io_submit: ctx_id: 0x7f3f18a61000, nr: 0x000000d1, iocbpp: 0x7ffd4e50b7b0
                 697 io_submit (/usr/lib64/libaio.so.1.0.1)
        7f3f1c6c9250 [unknown] ([unknown])
                   0 [unknown] ([unknown])

Number of submits per sec.
Time stamp diff: 1949.737723- 1914.589390 = 35.15
Number of submits:  493370

Submits/sec = 493370/35.15 = 14036.13086771
 IOPS metric is requests/second. We got submits per second. 
Number of requests per submit
requets/submission = (requests/sec) / (submits/sec)
                   =  76971.9 / 14036.130
                   =  5.483840631

So for my 4k write run number of requests per each submit are 5.48

iostat analysis: time spent for each IO request

These are one of the 4K write results to disk vdb (lvm volume on SSD which is irrelevent for present discussion).

“The average time (in milliseconds) for I/O requests issued to the device to be served. This includes the time spent by the requests in queue and the time spent servicing them

Wait_Time-vdb-write=0.042727

Throughput :

Throughput-vdb-write=65.902727

“svctm – The average service time (in milliseconds) for I/O requests that were issued to the device.

Device:         rrqm/s   wrqm/s     r/s     w/s    rMB/s    wMB/s avgrq-sz avgqu-sz   await r_await w_await  svctm  %util
vda               0.00     0.00    0.00    0.00     0.00     0.00     0.00     0.00    0.00    0.00    0.00   0.00   0.00
vdb               0.00     0.00 49409.00    0.00   193.00     0.00     8.00     7.04    0.14    0.14    0.00   0.02 100.00

Disk Utilization

Utilization-vdb=74.195455

Frame size; 4K

0.0427 ms = 42.7 microseconds.   42.7 microseconds per request
Throughput is 65 MB/s so 65 * 1024 / 4

= 16640 requests/s

1000000 microseconds/s / 16640 requests/s = 60 ms/request

60 ms/request * 0.75% disk utilization = 45 microseconds/request

So Time spend on each IO request is 45 microseconds/request

Thanks to Stefan