Thursday, September 1, 2016

Ceph Monitors Deadlock

Introduction
A part of my role scope is general maintenance of the QE department Ceph cluster. Usually it doesn’t consume a lot of my time, apart from general monitoring, providing keyrings and pools, etc, you know, general management.
‘My’ cluster has 3 servers with 9 disks each:
  • 2 for the OS, RHEL 7 (RAID 1)
  • 1 SSD for journaling
  • 6 disks as OSDs
And it works pretty well.

The Problem

A colleague asked me a question first thing in the morning about the relations of Ceph and Openstack, as a huge believer of teaching by examples I logged in into one of the servers and ran rbd command, showing the list of images in the pool.  
$ sudo rbd -p <pool name> --id <client> ls
The client failed to connect to the monitors, all 3 of them
2016-09-01 14:00:04.946448 7f2a2d2cc700  0 -- <IP address>:6789/0 >> <IP address>:6789/0 pipe(0x4cee000 sd=13 :0 s=1 pgs=0 cs=0 l=0 c=0x4967080).fault

Troubleshooting

What I could go with

First of all, when the RBD client fails to connect, it probably mean that the ceph client will not be effective as well. Thus no reason, IMO, to check for cluster health
$ sudo ceph health
Cause the reply will be the same.
The first thing on my mind is checking the monitors daemon status in all servers in the cluster
$ sudo service ceph status mon
The result was
=== mon.ceph1 ===
mon.ceph1: not running.
OK, then the daemon is down, let us bring it back up
$ sudo service ceph start mon
No joy - the daemon is down.
After that, I went to the Ceph’s monitor log, /var/log/ceph/ceph-mon-ceph.1.log it show me these following log entries. Two messages were highlighted, in my eyes, starting with:
2016-09-01 09:23:49.490950 7efd5a7137c0 -1 WARNING: 'mon addr' config option 10.35.65.98:0/0 does not match monmap file continuing with monmap configuration
With this line as the punchline:
2016-09-01 09:23:49.762012 7efd5021c700  0 cephx: verify_authorizer could not decrypt ticket info: error: NSS AES final round failed:-8190
So the problem is either with the monitors keyring, meaning it failed with the authentication, or there’s a problem with the monitors map configuration.

Dead ends (but should be checked)

  • The keyrings of the monitors were identical, no authentication problem (still might be permission issue, daemon fails to read file)
  • NTP service is up and running, all the clocks are in sync  

The Solution

Fixing this issue required me to use monmaptool command, for that I used
Though Sébastien Han recommends not to do it on a live cluster, I did it anyhow, with the minor risk of data lost in a staging environment.  
I got the cluster FSID from /etc/ceph/ceph.conf and created a new monmap with monmaptool
$ sudo monmaptool --create --add ceph1 <IP address>:6789 --add ceph2 <IP address>:6789 --add ceph3:6789 --fsid <Ceph’s cluster FSID> --clobber monmap
Once the file is available, I copied it to all the servers in the cluster and stopped all Ceph’s daemons
$ sudo service ceph stop


Now that the cluster is down and out, I can inject the newly created map to the monitors
$ sudo ceph-mon -i ceph<X> -inject-monmap monmap
Timidly I started the monitors daemons together (as much as I could) and behold!
=== mon.ceph1 ===
mon.ceph1: running {"version":"0.94.5-9.el7cp"}
Afterwards the rest of the Ceph’s daemons are available
$ sudo service ceph start
And the cluster status is HEALTH_OK

Thursday, March 31, 2016

Connect VMs running on KVM to VLAN with OpenVswitch


Background

In the course of my work I had to PXE boot on virtual machines that are running on 2 different physical servers. The limitations of the lab that I am working in would not allow me to run DHCP and PXE on the lab’s network. I did have a VLAN trunk connected to the physical servers at my disposal. At this stage I searched for a solution which would connect all VMs to the same network, to rise DHCP and PXE server and deploy the OS with it on them.

Why I chose OpenVswitch?

Due to my familiarity to Openstack I knew that OVS has these abilities that I needed. I presumed that it will be fairly easy to make it work, I was mistaken. The documentation about OVS and virtual machine is available but there was nothing that suits my exact needs which were pretty basic.

OpenVswitch Basic Commands

Presuming that OVS is installed there are some basic commands that one should be familiar with.
ovs-vsctl show - display the detail of bridges and ports
ovs-vsctl add-br <bridge name> - create bridge
ovs-vsctl del-br <bridge name> - delete bridge
ovs-vsctl add-port <bridge> <device> - add port to bridge
ovs-vsctl del-port <bridge> <device> - delete port in bridge
I know these commands seems trivial, but new users will find this helpful.

Disclaimers and Assumptions

I  work on RHEL 7, and I use Libvirt. If you’re using other Linux distributions, please make the necessary adjustments.


I don’t like  network manager, I’m not use to it and it only interfere with what I did, so please, please, 
please stop and disable the network manager service.
systemctl stop NetworkManager
systemctk disable NetworkManager

I do assume that Libvirt and OpenVswitch are installed and running on the machine.
yum install libvirt openvswitch -y
systemctl enable libvirt
systemctl enable openvswitch
systemctl start libvirt
systemctl start openvswitch

I’ll name all the bridges of the OVS with the tag of the VLAN, for an example, VLAN with the tag 101 will be named br-101.

Preparing the Physical Machines

As I said, I had 2 physical servers each with a dual port NIC, one port connected to lab network the other port to vlan trunk. The trunk’s range in this example is 1-10, I’ll use just two here, to simplify matters.

Creating Devices for the VLANs

When we start, the network configuration of the physical machines is as follow
ip a
1: lo: <LOOPBACK,UP,LOWER_UP> mtu 65536 qdisc noqueue state UNKNOWN
   link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00
   inet 127.0.0.1/8 scope host lo
      valid_lft forever preferred_lft forever
   inet6 ::1/128 scope host
      valid_lft forever preferred_lft forever

2: enp2s0f0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc mq state UP qlen 1000
   link/ether 00:8c:fa:02:91:0a brd ff:ff:ff:ff:ff:ff
   inet 192.168.0.124/24 brd 10.35.160.255 scope global dynamic enp2s0f0
      valid_lft 35723sec preferred_lft 35723sec
   inet6 2620:52:0:23a0:28c:faff:fe02:910a/64 scope global mngtmpaddr dynamic
      valid_lft 2591974sec preferred_lft 604774sec
   inet6 fe80::28c:faff:fe02:910a/64 scope link
      valid_lft forever preferred_lft forever
3: enp2s0f1: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc mq state UP qlen 1000
   link/ether 00:8c:fa:02:91:0b brd ff:ff:ff:ff:ff:ff
   inet6 fe80::28c:faff:fe02:910b/64 scope link
      valid_lft forever preferred_lft forever


The trunk is connected to NIC enp2s0f1. This NIC network script should configured: DEVICE=enp2s0f1
BOOTPROTO=none
ONBOOT=yes
TYPE=Ethernet


For each VLAN create a network script in /etc/sysconfig/network-scripts/ifcfg.<vlan tag>, for an example, with VLAN tag 1:
DEVICE=enp2s0f1.1
BOOTPROTO=none
ONBOOT=yes
USERCTL=no
VLAN=yes"


Restart the network service
systemctl restart network
The end result is
ip a ...
3: enp2s0f1: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc mq state UP qlen 1000
   link/ether 00:8c:fa:02:91:0b brd ff:ff:ff:ff:ff:ff
   inet6 fe80::28c:faff:fe02:910b/64 scope link
      valid_lft forever preferred_lft forever
4: enp2s0f1.1@enp2s0f1: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc noqueue master ovs-system state UP
   link/ether 00:8c:fa:02:91:0b brd ff:ff:ff:ff:ff:ff
   inet6 fe80::28c:faff:fe02:910b/64 scope link
      valid_lft forever preferred_lft forever
5: enp2s0f1.2@enp2s0f1: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc noqueue master ovs-system state UP
   link/ether 00:8c:fa:02:91:0b brd ff:ff:ff:ff:ff:ff
   inet6 fe80::28c:faff:fe02:910b/64 scope link
      valid_lft forever preferred_lft forever

Create OpenVswitch Bridge

Create a bridges for VLANs 1
ovs-vsctl add-br br-1
Create a port that connects the bridge to the VLAN
ovs-vsctl add-port br-1 enp2s0f1.1@enp2s0f1
After the creation of the bridges and the ports additional interfaces will be visible.

Create Network in Libvirt

Creating a network in Libvirt requires the following XML file to define the network.

<network>
 <name>br-1</name>
 <forward mode='bridge'/>
 <bridge name='br-1'/>
 <virtualport type='openvswitch'/>
</network>
Save this XML file with a simple name, for an example: br-1.xml

Define the network in Libvirt
virsh net-define br-1.xml
Start the network
virsh net-start br-1
Set the network to autostart (if not, the next time you’ll restart the Libvirt service the network will not be available)
virsh net-autostart br-1

Now Libvirt has an available network connected to VLAN and virtual machines don’t need NAT no more.

Tuesday, November 17, 2015

RabbitMQ Best Practices

Hi,
In the last Openstack summit in Tokyo one lecture caught my eye, a lecture by Michael Kishlin, a RabbitMQ operator. He had some tips about how to configure Linux OS parameters in order to deal with the load of these modern times. I took the liberty to summarize his lecture and add some comments, when I didn't know what he was talking about.

Personally, I've only started testing systems with these changes so I don't know what are the implications in the long run, so please, be cautious, don't do anything just because someone on the internet told you to do it.

Adapt Your OS Resources

Inadequate OS Settings

In the world we're living today, each service operators are running and managing has different requirements, but a lot of them are running them on OS with the default parameters. unfortunately,   Linux OS defaults are not suitable for these tasks, they are suitable for operations as if they were in the 1990’s.
Here are some points about how you, as the operator, can change the OS parameter to improve performances.

Open Files Limits

The default number of open files in the Linux OS is 1024 files, to get parameter
# ulimit -n
1024
The recommended number of open file for a Linux OS with Rabbit MQ is 500K open files

To change the limit of open files temporarily (till the next reboot)
# ulimit -n 512000

Changing the parameter permanently requires to add 2 lines in /etc/security/limits.conf
*               soft    nofile          512000
*               hard    nofile         512000

The default parameter of file-max is set as 10% of the system’s total RAM. The recommendation for a system with Rabbit MQ is 500K.

Changing the number of file handles requires the following:
Set the integer in /proc/sys/fs/file-max
# echo 512000 >> /proc/sys/fs/file-max
This is not a permanent change, though.
Set the parameter fs.file-max = 512000 in /etc/sysctl.conf

Restart the network process for the changes to take affect
# systemctl restart network

TCP Keep Alive 

Linux built-in support for checking keeping alive connections over TCP. The keepalive configuration is set with three parameters:

net.ipv4.tcp_keepalive_time = 7200
net.ipv4.tcp_keepalive_intvl = 75
net.ipv4.tcp_keepalive_probes = 9

net.ipv4.tcp_keepalive_time sets the time in seconds in which a process will wait until it will send a keepalive probe. net.ipv4.tcp_keepalive_intvl is the time of the intervals, in seconds, between probes. net.ipv4.tcp_keepalive_probes is the number of times a probe will be launched. Keeping a connection open for 2 hours (7200 seconds) is not suitable for the demand on an OS with Rabbit MQ, the recommendation is to decrease the amount of time and intervals in order to close the connections quickly as possible.

In the lecture, it is said to keep a connection open for 6 seconds until sending a probe, after some consulting with my colleagues I would recommend to set the keep alive time to 30 seconds.  

Changing the values temporarily (till the next reboot)
# sysctl -w net.ipv4.tcp_keepalive_intvl=3
# sysctl -w net.ipv4.tcp_keepalive_time=30
# sysctl -w net.ipv4.tcp_keepalive_probes=3

Changing the setting permanently requires adding the parameters to /etc/sysctl.conf
net.ipv4.tcp_keepalive_probes=3
net.ipv4.tcp_keepalive_intvl=3
net.ipv4.tcp_keepalive_time=30

In Openstack setup the different components are basically Rabbit MQ clients, the default configuration set the heartbeat interval to 60 by default. The recommendation is to set it to a value between 6-12 seconds. For an example
heartbeat_timeout_threshold=10

High Throughput and Concurrent Connections

TCP Buffers

Most Linux servers are working with the default parameters and working well, but for some the defaults are not enough. For some servers high throughput is a priority and we’ll need to change two parameters, net.core.rmem_max and net.core.wmem_max with higher values. Currently, the default values are:

net.core.rmem_max = 212992
net.core.wmem_max = 212992
Which is about ~208K.

The recommendation is to increase the maximum to 16M
Note: This is a recommendation, it depends whether the hardware supports it, the requirements can differ in each case.

Set the new values temporarily:
# sysctl -w net.core.wmem_max = 16777216
# sysctl -w net.core.rmem_max = 16777216

To set it permanently, add the parameters are values to /etc/sysctl.conf
# echo ‘net.core.wmem_max = 16777216’ >> /etc/sysctl.conf
# echo ‘net.core.rmem_max = 16777216’ >> /etc/sysctl.conf

A large amount of connections cost the OS RAM, we would like our OS to avoid leaving a connection in the state of TIME_WAIT for long periods of time (the default value, as said, is 2 hours).

tcp_fin_timeout sets the number of seconds that the OS will wait in the state FIN-WAIT-2 . By reducing the time from the default 60, let’s say, 5 seconds, a connection will wait in FIN-WAIT-2 state less time, thus shortening the process by 5 seconds. To set a new value to the parameter temporarily, till the next reboot

# sysctl -w net.ipv4.tcp_fin_timeout = 10
To set it permanently
# echo ‘net.ipv4.tcp_fin_timeout = 10’ >> /etc/sysctl.conf

tcp_tw_reuse allows the OS to reuse an outgoing connection that waits in TIME_WAIT state, an exmaple of a use case for it is web servers, enable the OS to create a lot of short connections. By default tcp_tw_reuse is set to 0, meaning that the OS will wait until the connection is fully closed before using a used port again. It is recommended to enable it on client side rather than server, because it does not handle incoming connections. To enable it temporarily

# sysctl -w net.ipv4.tcp_tw_reuse =1
To set permanently
# echo ‘net.ipv4.tcp_tw_reuse = 1’ >> /etc/sysctl.conf

Disclaimer:  the tcp_tw_reuse is not safe 100% of the time, in general the TIME-WAIT state of TCP connections is working well. Do not enable this parameter without considerable thought.

net.core.somaxconn is the size of the listen queue, the amount of connections accepted in the same time. The default value is either 128 or 256. A bigger number will enable the RabbitMQ to support a burst of incoming connections, when client reconnect, like after restarting a service. It is recommended to increase the number to 4096.

Set it temporarily
# sysctl -w net.core.somaxconn = 4096
Or permanently
# echo ‘net.core.somaxconn = 4096’ >> /etc/sysctl.conf

In conclusion, the sysctl configuration file should have following
# cat /etc/sysctl.conf
fs.file-max = 512000
net.ipv4.tcp_keepalive_probes=3
net.ipv4.tcp_keepalive_intvl=3
net.ipv4.tcp_keepalive_time=30
net.core.rmem_max = 16777216
net.core.wmem_max = 16777216
net.ipv4.tcp_fin_timeout = 10
net.ipv4.tcp_tw_reuse =1
net.core.somaxconn = 4096

RabbitMQ configuration changes

RabbitMQ’s TCP buffers are set in /etc/rabbitmq/rabbitmq.confing. In the configuration file there are three parameters in the tcp_listen_options for us to change. The backlog is the number of inbound connections allowed at the same time, the default parameter is 128, it is recommended to increase it to 4096.
The inbound and outbound buffers, recbuf and sndbuf, according to Michael Kishlin, the RAM used for each connection can be reduced by 10 if these parameters will be set to 16384, but it will reduce the throughput, one should have to test it and find the right balance. In /etc/rabbitmq/rabbitmq.config, the tcp_listen_options should be
[
{rabbit, [
      {tcp_listen_options, [
                                          {packet, raw},
                                          {reuseaddr, true},
                                          {backlog, 4096},
                                          {recbuf,16384},
                                          {sndbuf,16384},
                                          {nodelay, true},
                                          {exit_on_close, false},
                                          {keepalive, true}
      ]}
]