4. Troubleshooting

If the applications you are using do not work or show increased performance, please carefully follow this troubleshooting guide.

If the applications is using SuperSockets:

  1. To verify that the preloading works, use the ldd command on any executable, i.e. the netperf binary mentioned above:

    $ export LD_PRELOAD=libksupersockets.so
    $ ldd netperf
            libksupersockets.so => /opt/DIS/lib64/libksupersockets.so (0x0000002a95577000)
            libpthread.so.0 => /lib64/tls/libpthread.so.0 (0x00000033ed300000)
            libc.so.6 => /lib64/tls/libc.so.6 (0x00000033ec800000)
            libdl.so.2 => /lib64/libdl.so.2 (0x00000033ecb00000)
            /lib64/ld-linux-x86-64.so.2 (0x00000033ec600000)
    

    The library libksupersockets.so has to be listed at the top position. If this is not the case, make sure the library file actually exists. The default locations are /opt/DIS/lib/libksupersockets.so and /opt/DIS/lib64/libksupersockets.so on 64-bit platforms, and libksupersockets.so actually is a symbolic link on a library with the same name and a version suffix:

    $ ls -lR /opt/DIS/lib*/*ksupersockets*
    -rw-r--r--  1 root root 29498 Nov 14 12:43 /opt/DIS/lib64/libksupersockets.a
    -rw-r--r--  1 root root   901 Nov 14 12:43 /opt/DIS/lib64/libksupersockets.la
    lrwxrwxrwx  1 root root    25 Nov 14 12:50 /opt/DIS/lib64/libksupersockets.so -> 
                                               libksupersockets.so.3.3.0
    lrwxrwxrwx  1 root root    25 Nov 14 12:50 /opt/DIS/lib64/libksupersockets.so.3 -> 
                                               libksupersockets.so.3.3.0
    -rw-r--r--  1 root root 65160 Nov 14 12:43 /opt/DIS/lib64/libksupersockets.so.3.3.0
    -rw-r--r--  1 root root 19746 Nov 14 12:43 /opt/DIS/lib/libksupersockets.a
    -rw-r--r--  1 root root   899 Nov 14 12:43 /opt/DIS/lib/libksupersockets.la
    lrwxrwxrwx  1 root root    25 Nov 14 12:50 /opt/DIS/lib/libksupersockets.so -> 
                                               libksupersockets.so.3.3.0
    lrwxrwxrwx  1 root root    25 Nov 14 12:50 /opt/DIS/lib/libksupersockets.so.3 -> 
                                               libksupersockets.so.3.3.0
    -rw-r--r--  1 root root 48731 Nov 14 12:43 /opt/DIS/lib/libksupersockets.so.3.3.0
    

    Also, make sure that the dynamic linker is configured to find it in this place. The dynamic linker is configured accordingly on installation of the RPM; if you did not install via RPM, you need to configure the dynamic linker manually. To verify that the dynamic linking is the problem, set LD_LIBRARY_PATH to include the path to libksupersockets.so and verify again with ldd:

    $ export LD_LIBRARY_PATH=$LD_LIBRARY_PATH:/opt/DIS/lib:/opt/DIS/lib64
    $ echo $LD_PRELOAD
    libksupersockets.so
    $ ldd netperf
    ....

    A better solution than setting LD_LIBRARY_PATH is to configure the dynamic linker ld to include these directories in its search path. Use man ldconfig to learn how to achieve this.

  2. You need to make sure that the preloading of the SuperSockets library described above is effective on both Cluster Nodes, for both applications that should communicate via SuperSockets.

  3. Make sure that the SuperSockets kernel module (and the kernel modules it depends on) are loaded and configured correctly on both Cluster Nodes.

    1. Check the status of all Dolphin kernel modules via the dis_services script (default location /opt/DIS/sbin):

      # dis_services status
      Dolphin kOSIF 5.5.0 is running
      Dolphin PX 5.5.0 is running
      Dolphin IRM 5.5.0 (  January 10th 2018 ) is running.
      Dolphin Node Manager is running (pid 3172).
      Dolphin SISCI 5.5.0 (  January 10th 2018 ) is running.
      Dolphin SuperSockets 5.5.0 "Express Train", January 10th 2018 (built January 10th 
      2018) running.
      

      At least the services dis_irm and dis_supersockets need to be running, and you should not see a message about SuperSockets not being configured.

    2. Verify that SuperSockets have the correct view of the PCI Express adapters within the cluster. Call dis_ssocks_adm with the option -n:

      root@d0 # dis_ssocks_adm -n
      Local   node ID list
      -------------------------------------
        X        4    0 
                 8    0 
                12    0 
                16    0
      

      Running this command on all Cluster Nodes should give identical output apart from the marker X which indicates the current Cluster Node.

      If this is not the case, the affected Cluster Node has an invalid /etc/dis/dishosts.conf. Make sure that the dis_nodemgr service is up on this Cluster Node, and that the Cluster Node is shown active in the dis_admin GUI.

    3. Verify the SuperSockets routing configuration if all cluster Cluster Nodes will connect and communicate via SuperSockets using the right IP addresses. The active configuration can be retrieved via dis_ssocks_adm -m:

      # dis_ssocks_adm -m
      IP/net             Adapter    NodeId List
      -----------------------------------------------
      172.16.5.1/32      0x0000        4    0    0
      172.16.5.2/32      0x0000        8    0    0
      172.16.5.3/32      0x0000       68    0    0
      172.16.5.4/32      0x0000       72    0    0

      Depending on the configuration variant you used to set up SuperSockets, the content of this file may look different, but it must never be empty and should be identical on all Cluster Nodes. The example above shows a four-node cluster with a single fabric and a static SuperSockets configuration, which will accelerate one socket interface per Cluster Node.

      For more information on the configuration of SuperSockets, please refer to Section 1.1, “dishosts.conf”.

    4. Make sure that the host names/IP addresses used effectively by the application are the ones that are configured for SuperSockets, especially if the Cluster Nodes have multiple Ethernet interfaces configured.

  4. SuperSockets provide an internal event log, which can be accesses via dis_ssocks_diag. To attach to the event log and get all events printed to the terminal as they occur, use dis_ssocks_diag-Ev. If you then run the application, you will see all connection attempts and their results.

    A successful connection attempt of a client towards a server via the PCI Express interconnect will look like this:

    [Jul 14 14:08:36] TRACE: new SuperSocket created
                      local:0.0.0.0:0 peer:0.0.0.0:0 pid:3293 obj:0x0xffff880259440800
    [Jul 14 14:08:36] TRACE: SuperSockets connection established
                      local:172.16.6.15:35394 peer:172.16.6.16:5432 pid:3293 
                      obj:0x0xffff880259440800
    [Jul 14 14:08:37] TRACE: releasing stream socket
                      local:172.16.6.15:35394 peer:172.16.6.16:5432 pid:3293 
                      obj:0x0xffff880259440800
    	    

    The server will report the accepted SuperSockets connection like this:

    [Jul 14 14:10:35] TRACE: native accept succeeded
                      local:0.0.0.0:5432 peer:172.16.6.15:55215 pid:21472 
                      obj:0x0xffff880257454800
    [Jul 14 14:10:35] TRACE: SuperSockets connection accepted
                      local:172.16.6.16:5432 peer:172.16.6.15:55215 pid:21472 
                      obj:0x0xffff880257454c00
    [Jul 14 14:10:35] TRACE: releasing stream socket
                      local:0.0.0.0:5432 peer:0.0.0.0:0 pid:21472 
                      obj:0x0xffff880257454800
    [Jul 14 14:10:36] TRACE: releasing stream socket
                      local:172.16.6.16:5432 peer:172.16.6.15:55215 pid:21472 
                      obj:0x0xffff880257454c00
              

    A client's connection towards a server that (the client thinks) is not configured to use SuperSockets is performed via Ethernet and reported as follows:

    [Jul 14 14:11:16] TRACE: new SuperSocket created
                      local:0.0.0.0:0 peer:0.0.0.0:0 pid:3320 obj:0x0xffff880259440000
    [Jul 14 14:11:16] WARN: admin msg SYN_CLIENT failed err:0x6f
                      local:172.16.6.15:35652 peer:172.16.6.16:5432 pid:3320 
                      obj:0x0xffff880259440000
    [Jul 14 14:11:16] WARN: fallback connection established
                      local:172.16.6.15:35652 peer:172.16.6.16:5432 pid:3320 
                      obj:0x0xffff880259440000
    [Jul 14 14:11:29] TRACE: releasing stream socket
                      local:172.16.6.15:35652 peer:172.16.6.16:5432 pid:3320 
                      obj:0x0xffff880259440000 
              

    If a client tries to connect via SuperSockets, but fails to do, it falls back to Ethernet by default. This fall-back capability can be disabled to ensure that SuperSockets and nothing else are actually used if they are to be used. The event log will look like this:

    [Jul 14 14:12:24] TRACE: new SuperSocket created
                      local:0.0.0.0:0 peer:0.0.0.0:0 pid:21491 obj:0x0xffff88025736f800
    [Jul 14 14:12:29] TRACE: native accept succeeded
                      local:0.0.0.0:5432 peer:172.16.6.15:47158 pid:21491 
                      obj:0x0xffff88025736f800
    [Jul 14 14:12:29] WARN: fallback connection accepted
                      local:172.16.6.16:5432 peer:172.16.6.15:47158 pid:21491 
                      obj:0x0xffff880256f94000
    [Jul 14 14:12:29] TRACE: releasing stream socket
                      local:0.0.0.0:5432 peer:0.0.0.0:0 pid:21491 
                      obj:0x0xffff88025736f800
    [Jul 14 14:12:42] TRACE: releasing stream socket
                      local:172.16.6.16:5432 peer:172.16.6.15:47158 pid:21491 
                      obj:0x0xffff880256f94000 
              

    The server may not report this event at all, as it may not got notice of it.

    For an explanation of typical error messages, please refer to Section 2, “Software”.

  5. Don't forget to check if the port numbers used by this application, or the application itself have been explicitly been excluded from using SuperSockets. By default, only the system port numbers below 1024 are excluded from using SuperSockets, but you should verify the current configuration using dis_ssocks_adm -p (see Section 2, “SuperSockets Configuration”).

  6. If you can't solve the problem, please contact Dolphin Support. When doing so, please attach

    • the output of dis_status

    • the output of dis_ssocks_diag -Ev for the connection tries.