2. Rolling Update

A rolling update will keep your cluster and all its services available on all but one Cluster Node at a time. This kind of update needs to be performed Cluster Node by Cluster Node. It requires that you stop all applications which use the eXpressWare software stack (like a database server using SuperSockets) on the Cluster Node you intend to update. This means your systems needs to tolerate applications going down on a single Cluster Node.

Before performing a rolling update, please refer to the release notes of the new version to be installed if it supports a rolling update of the version currently installed. If this is not the case, you need to perform a complete update (see previous section).

Note

It is possible to install the updated files while the applications are still using PCI Express services. However, in this case the updated PCI Express services will not become active until you restart them (or reboot the machine).

Perform the following steps on each Cluster Node:

  1. Log into the Cluster Node and become superuser (root).

  2. Build the new binary RPM packages for this Cluster Node:

    # sh  ./Dolphin_eXpressWare-<version>.sh --build-rpm

    The created binary RPM packages will be stored in the subdirectories node_RPMS and frontend_RPMS which will be created in the current working directory.

    Tip

    To save a lot of time, you can use the binary RPM packages built on the first Cluster Node that is updated on all other Cluster Nodes (if they have the same CPU architecture and Linux version). Please see Section 2.3, “Installing from Binary RPMs” for more information.

  3. Stop all applications on this Cluster Node that use Dolphin PCI Express services.

  4. Stop all Dolphin PCI Express services on this Cluster Node using the dis_services command:

    # dis_services stop
    Stopping Dolphin SuperSockets drivers                      [  OK  ]
    Stopping Dolphin SISCI driver                              [  OK  ]
    Stopping Dolphin Node Manager                              [  OK  ]
    Stopping Dolphin IRM driver                                [  OK  ]
    Stopping Dolphin MX driver                                 [  OK  ]
    Stopping Dolphin KOSIF driver                              [  OK  ]

    If you run dis_admin, you will notice that this Cluster Node will show up as disabled (not active).

    Note

    The SIA will also try to stop all services when doing an update installation. Performing this step explicitly will just assure that the services can be stopped, and that the applications are shut down properly.

    If the services can not be stopped for some reason, you can still update the Cluster Node, but you have to reboot it to enable the updated services. See the --reboot option in the next step.

  5. Run the SIA with the --install-node --use-rpms <path> options to install and updated RPM packages and start the updated drivers and services. The <path> parameter to the --use-rpms option has to point to the directory where the binary RPM packages have been built (see step 1). If you had run the SIA in /tmp in step 1, you would issue the following command:

    # sh  Dolphin_eXpressWare-<version>.sh --install-node --use-rpms /tmp

    Adding the option --reboot will reboot the Cluster Node after the installation has been successful. A reboot is not required if the services were shut down successfully in step 4, but recommend to allow the low-level driver the allocation of sufficient memory resources for remote-memory access communication.

    Important

    If the services could not be stopped in step 4, a reboot is required to allow the updated drivers to be loaded. Otherwise, the new drivers will only be installed on disk, but will not be loaded and used.

    If for some reason you want to re-install the same version, or even an older version of the Dolphin PCI Express software stack than is currently installed, you need to use the --enforce option.

  6. The updated services will be started by the installation and are available for use by the applications. Make sure that Cluster Node has shown up as active (green) in dis_admin again before updating the next Cluster Node.

    If the services failed to start, a reboot of the Cluster Node will fix the problem. This can be caused by situations where the memory is too fragmented for the low-level driver (see above).