Summary
Error
SLES
If you are running Vertica on SUSE Linux Enterprise Server (SLES), you may experience an issue when creating a new database. The following sections detail the versions of Vertica and SLES this issue impacts, root cause, and solution.
Environment
- Vertica 9.1 and higher
- SUSE Linux Enterprise Server 12 SP2 and higher
-
Server with Intel CPU that has Hardware Lock Elision (HLE) functionality similar to Haswell, Broadwell, etc.
To check if the CPU has the HLE functionality, run the following command:
Issue
Vertica cluster does not start up in the SLES environment. During start up, Vertica process crashes with the following messages in vertica.log:
Following is the backtrace in ErrorReport.txt:
Cause
Some Intel CPUs do not handle the Hardware Lock Elision correctly.
Solution
The Hardware Lock Elision (HLE) functionality needs to be disabled.
To disable this, add an absolute path of the directory where the noelision libraries are located in the library load path.
-
Edit /etc/ld.so.conf as in the following. Add the first line to the command:
-
Run
ldconfig
command. -
Run
ldconfig -p | grep noel
command. You should get the following result that includes "noelision" in the target path. -
Run
ldd /opt/vertica/bin/vertica | grep libpthread
command. You should get the following result that includes "noelision" in the target path again. - Create the database again.
CENTOS/RHEL
Update: Issue Resolved In Red Hat 7.x
In February 2016, we released a support notice about a glibc bug that was causing failures in Vertica databases running on Red Hat Enterprise Linux 7.x and CentOS 7.x. (See Details below)
This update notes that Red Hat has released a fix for these issues. For more information, see the Red Hat site.
Users can upgrade glibc to the following version (or a later version) to avoid this problem:
glibc-2.17-106.el7_2.6.x86_64.rpm
To upgrade glibc, you must perform the following:
1. Restart Vertica as dbadmin:
2. Run the following command as root on all nodes:
3. Run the following command as dbadmin:
Original Notice (Published 2/2016)
If you are running Vertica on Red Hat Enterprise Linux 7.x (or CentOS 7.x), be aware that you may experience a Vertica server process failure due to a known issue in RHEL/CentOS. This article explains the root cause of the RHEL/CentOS bug, details the symptoms you’ll see if you encounter this Red Hat bug, and provides guidance on how to proceed.
Root Cause of the RHEL/CentOS Bug
The problem that causes the Vertica failure stems from the fact that an important glibc bug fix has not been applied to several distributions of RHEL 7.x and downstream distributions like CentOS 7.x.
The glibc bug fix that is missing is described here:
https://www.sourceware.org/bugzilla/show_bug.cgi?id=15073
Update: Red Hat has released a fix, available here:
https://rhn.redhat.com/errata/RHBA-2016-1030.html
The fix is not yet available on CentOS. We will publish an update as soon as this fix is available on CentOS.
Note
This issue appears in Vertica running on RHEL and CentOS 7.x distributions only. The issue does not appear with Ubuntu and Debian distributions of Linux.
What you’ll see if this problem occurs
- If this problem occurs, the Vertica server process will fail, and you’ll see the following error in the <CATALOG_DIRECTORY>/dbLog file.
*** Error in `/opt/vertica/bin/vertica': invalid fastbin entry (free): 0x00007ef70f209800 ***
======= Backtrace: =========
0x7f0614f0efe1(/lib64/libc.so.6): + 0x7cfe1
0x2a1e014(/opt/vertica/bin/vertica) CAT::TabColPair_pairToBytes2(void const*, void*, unsigned long)
-
In addition, you’ll notice that the vertica.log file appears as if was truncated at an arbitrary place, sometime in the middle of a line.
- Finally, on the core file for the failure, the following pattern appears at the top of the stack
raise
abort
__libc_message
_int_free <==========
CAT::TabColPair_pairToBytes2(void const*, void*, unsigned long)
How to determine whether you have the affected glibc
To determine whether the patch has been applied to your glibc, you can either:
- Run the objdump utility, or
- Examine the libc.so file manually
Run the objdump utility
- Find your libc.so file using the following command:
ldd /opt/vertica/bin/vertica | grep libc.so
libc.so.6 => /lib64/libc.so.6 (0x00007ff6dd99e000
- Run the objdump utility as shown below to determine whether fix has been applied:
## example of buggy lib.c
objdump -r -d /lib64/libc.so.6 | grep -C 20 _int_free | grep -C 10 cmpxchg | head -21 | grep -A 3 cmpxchg | tail -1 | (grep '%r' && echo "Your libc is likely buggy." || echo "Your libc looks OK.")
7ca16: 48 85 c9 test %rcx,%rcx
Your libc is likely buggy.
## example of good lib.c
objdump -r -d /lib/x86_64-linux-gnu/libc.so.6 | grep -C 20 _int_free | grep -C 10 cmpxchg | head -21 | grep -A 3 cmpxchg | tail -1 | (grep '%r' && echo "Your libc is likely buggy." || echo "Your libc looks OK.")
Your libc looks OK.
Examine the libc.so file manually
You can also choose to examine your libc in its entirety and identify whether the fix has been applied or not. The following example contains the string ‘test %dil,%dil’. This means that the fix has been applied:
objdump -r -d /lib64/libc-2.12.so | grep -C 20 _int_free | grep -C 10 cmpxchg | head -21
32cd8786cb: 40 20 f7 and %sil,%dil
32cd8786ce: 74 0c je 32cd8786dc <_int_free+0xec>
32cd8786d0: 4c 8b 42 08 mov 0x8(%rdx),%r8
32cd8786d4: 41 c1 e8 04 shr $0x4,%r8d
32cd8786d8: 41 83 e8 02 sub $0x2,%r8d
32cd8786dc: 48 89 53 10 mov %rdx,0x10(%rbx)
32cd8786e0: 48 89 d0 mov %rdx,%rax
32cd8786e3: 64 83 3c 25 18 00 00 cmpl $0x0,%fs:0x18
32cd8786ea: 00 00
32cd8786ec: 74 01 je 32cd8786ef <_int_free+0xff>
32cd8786ee: f0 48 0f b1 19 lock cmpxchg %rbx,(%rcx)
32cd8786f3: 48 39 c2 cmp %rax,%rdx
32cd8786f6: 75 c0 jne 32cd8786b8 <_int_free+0xc8>
32cd8786f8: 40 84 ff test %dil,%dil <==** likely good**==
32cd8786fb: 74 09 je 32cd878706 <_int_free+0x116>
32cd8786fd: 41 39 e8 cmp %ebp,%r8d
32cd878700: 0f 85 05 07 00 00 jne 32cd878e0b <_int_free+0x81b>
32cd878706: 48 83 c4 28 add $0x28,%rsp
32cd87870a: 5b pop %rbx
32cd87870b: 5d pop %rbp
32cd87870c: 41 5c pop %r12
The following example does not contain the string ‘test %dil,%dil’ . This means the fix has not been applied:
objdump -r -d /lib64/libc-2.17.so | grep -C 20 _int_free | grep -C 10 cmpxchg | head -21
7c9ec: 48 85 c9 test %rcx,%rcx
7c9ef: 74 09 je 7c9fa <_int_free+0xda>
7c9f1: 8b 41 08 mov 0x8(%rcx),%eax
7c9f4: c1 e8 04 shr $0x4,%eax
7c9f7: 8d 70 fe lea -0x2(%rax),%esi
7c9fa: 48 89 4b 10 mov %rcx,0x10(%rbx)
7c9fe: 48 89 c8 mov %rcx,%rax
7ca01: 64 83 3c 25 18 00 00 cmpl $0x0,%fs:0x18
7ca08: 00 00
7ca0a: 74 01 je 7ca0d <_int_free+0xed>
7ca0c: f0 48 0f b1 1a lock cmpxchg %rbx,(%rdx)
7ca11: 48 39 c1 cmp %rax,%rcx
7ca14: 75 ca jne 7c9e0 <_int_free+0xc0>
7ca16: 48 85 c9 test %rcx,%rcx <==**likely buggy**===
7ca19: 74 09 je 7ca24 <_int_free+0x104>
7ca1b: 44 39 e6 cmp %r12d,%esi
7ca1e: 0f 85 84 08 00 00 jne 7d2a8 <_int_free+0x988>
7ca24: 48 83 c4 48 add $0x48,%rsp
7ca28: 5b pop %rbx
7ca29: 5d pop %rbp
7ca2a: 41 5c pop %r12
What you should do
If the problem occurs, and you are using Red Hat Enterprise Linux, download this glibc fix here:
https://rhn.redhat.com/errata/RHBA-2016-1030.html
If you are on CentOS, you should contact your operating system vendor and request a fix for this issue.
You can also choose to build the latest GLIBC 2.17 from source. Hewlett Packard Enterprise recommends testing this process in a staging area before implementing it in production. As with any major operation on your system, Hewlett Packard Enterprise recommends backing up your system before this operation.