1. Verifying Functionality and Performance

The following sections describe how to verify that the interconnect is setup correctly, which means that all Cluster Nodes can communicate with all other Cluster Nodes via the PCI Express interconnect by sending low-level control packets and performing remote memory access.

1.1. Availability of Drivers and Services

The Cluster Management Node functionality is optional for SISCI based applications but currently mandatory for SuperSockets operation. The Cluster Management Node will automatically distribute configuration files and simplify diagnostic of the cluster. On the Cluster Management Node, only the user-space service dis_networkmgr (the central Network Manager) needs to be running.

Without the required drivers and services running on a Cluster Node, the node will fail to communicate with other nodes. On the Cluster Nodes, the kernel services dis_px_irm (interconnect resources driver) and dis_px_sisci (upper level hardware services) need to be running. Next to these kernel drivers, the user-space service dis_nodemgr (node manager, which talks to the central Network Manager) needs to be active for configuration and monitoring.

Because the drivers do also appear as services, you can query their status with the usual tools of the installed operating system distribution.

C:\>sc.exe query dis_px_irm

SERVICE_NAME: dis_irm
        TYPE               : 1  KERNEL_DRIVER
        STATE              : 4  RUNNING
                                (STOPPABLE, NOT_PAUSABLE, IGNORES_SHUTDOWN)
        WIN32_EXIT_CODE    : 0  (0x0)
        SERVICE_EXIT_CODE  : 0  (0x0)
        CHECKPOINT         : 0x0
        WAIT_HINT          : 0x0
        

If any of the required services is not running, you will find more information on the problem that may have occurred in the system log facilities. Call Control Panel -> Administrative Tools -> Event Viewer to inspect the kernel messages, and check %WinDir%\system32\drivers\etc\dis\log. for related messages.To get the installation logs, please run each MSI with /l*xv log_install.txt switch.

1.2. PCIe Connection Test

To ensure that the cluster is cabled correctly, please perform the PCIe connection test as described in Chapter 4, Initial Installation, Section 3.7.4, “PCIe connection Test”.

1.3. Static PCIe Interconnect Test - dis_diag

The static interconnect test makes sures that all PCIe communication hardware are working correctly by performing a self-test, and determines if the setup and the PCIe routing is correct (matches the actual hardware topology). It will also check all PCIe connections, but this has already been done in the PCIe Connection Test. The tool to perform this test is dis_diag (default location c:\Program Files\Dolphin Express PX\Util\dis_diag).

Running dis_diag on a Cluster Node will perform a self test on the local adapter(s) and list all remote adapters that this adapter can see via the PCI Express interconnect. This means, to perform the static interconnect test on a full cluster, you will basically need to run dis_diag on each Cluster Node and see if any problems with the adapter are reported, and if the adapters in each Cluster Node can see all remote adapters installed in the other Cluster Nodes.

Normally you should invoke dis_diag with no arguments, and it will do a general test and only show the most interesting information. Advanced users may want to enable the full verbose mode by using the -V 9 command line option:

	  dis_diag -V 9	
	

The -V 9 option will generate a lot of information, some parts of the information requires knowledge about the PCIe chipset and the PCIe specification in general. The diagnostic module will collect various usage and error information over time. This can be cleared by using the -clear command line option:

	  dis_diag -clear	
	

An example output of dis_diag for a Cluster Node which is part of a 2 node cluster and using one PXH830 adapter per Cluster Node looks like this:

[root@Hetty ~]# /opt/DIS/sbin/dis_diag 
================================================================================
  Dolphin diagnostic tool --  dis_diag version 5.5.0 ( Thu Jan 10th 16:23:13 CET 2018 )
================================================================================

dis_diag compiled in 64 bit mode
Driver : Dolphin IRM (GX) 5.5.0.0 Jan 10th 2018 (rev unknown)
Date   : Wed Feb  1 14:34:26 CET 2018
System : Linux Hetty 3.10.0-514.6.1.el7.x86_64 #1 SMP Wed Jan 18 13:06:36 UTC 2017 
         x86_64 x86_64 x86_64 GNU/Linux

Number of configured local adapters found: 1

Adapter 0 > Type                       : PXH830
            NodeId                     : 4
            Serial number              : PXH830-BC-000116
            PXH chip family            : PLX_DRACO_2  
            PXH chip vendorId          : 0x10b5  
            PXH chip device            : 0x8733  
            PXH chip revision          : 0xCA 
            EEPROM version NTB mode    : 05  
            EEPROM vendor info         : 0x0000  
            Firmware version           : 05.01
            Card revision              : BC
            Topology type              : Direct 2 nodes
            Topology Autodetect        : No
            Number of enabled links    : 1
            Max payload size (MPS)     : 256 
            Multicast group size       : 2 MB 
            Prefetchable memory size   : 32768 MB (BAR2)
            Non-prefetchable size      : 64 MB (BAR4)
            Clock mode slot            : Port 
            Clock mode link            : Global 
            PCIe slot state            : x16, Gen3 (8 GT/s) 
            PCIe slot capabilities     : x16, Gen3 (8 GT/s) 

*************************  PXH ADAPTER 0 LINK 0 STATE  *************************

            Link 0 uptime              : 91371 seconds
            Link 0 state               : ENABLED
            Link 0 state               : x16, Gen3 (8 GT/s) 
            Link 0 required            : x16, Gen3 (8 GT/s)
            Link 0 capabilities        : x16, Gen3 (8 GT/s)
            Link 0 cable inserted      : 1 
            Link 0 active              : 1 
            Link 0 configuration       : NTB

****************************  PXH ADAPTER 0 STATUS  ****************************

            Chip temperature           : 87 C 
            Board temperature          : 50 C 

***************  PXH ADAPTER 0, PARTNER INFORMATION FOR LINK 0  ***************
 
            Partner adapter type       : PXH830
            Partner serial number      : PXH830-000117
            Partner link no            : 0
            Partner number of ports    : 1
 
*****************************  TEST OF ADAPTER 0  *****************************

OK: PXH chip alive in adapter 0.
OK: Link alive in adapter 0.
==> Local adapter 0 ok.

************************  TOPOLOGY SEEN FROM ADAPTER 0  ************************

Adapters found: 2
----- List of all nodes found:

Nodes detected:   0004  0008 

***********************  SESSION STATUS FROM ADAPTER 0  ***********************

Node 4: Session valid
Node 8: Session valid

----------------------------------
dis_diag discovered 0 note(s).
dis_diag discovered 0 warning(s).
dis_diag discovered 0 error(s).
TEST RESULT: *PASSED*
	

The static interconnect test passes if dis_diag delivers TEST RESULT: *PASSED* and reports the same topology (remote adapters) on all Cluster Nodes.

1.4. Interconnect Load Test

While the static interconnect test sends very a few packets over the links to probe remote nodes, the Interconnect Load Test puts significant stress on the interconnect and observes if any data transmissions have to be retried due to link errors. This can happen if cables are not correctly connected, i.e. plugged in without connector latches locking correctly. Before running this test, make sure your cluster is connected and configured correctly by running the tests described in the previous sections.

1.4.1. Test Execution from Dolphin dis_admin GUI

This test can be performed from within the Dolphin dis_admin GUI tool. Please refer to Appendix A, dis_admin Reference for details.

1.4.2. Test Execution from Command Line

To run this test from the command line, simply invoke sciconntest (default location c:\Program Files\Dolphin Express PX\Demo\sciconntest) on all Cluster Nodes.

Note

It is recommended to run this test from the dis_admin GUI (see previous section) because it will perform a more controlled variant of this test and give more helpful results.

All instances of sciconntest will connect and start to exchange data, which can take up to 30 seconds. The output of sciconntest on one Cluster Node which is part of a 9-Cluster Node cluster looks like this:

/opt/DIS/bin/sciconntest compiled Oct  2 2007 : 22:29:09
 ----------------------------
 Local node-id      : 76
 Local adapter no.  : 0
 Segment size       : 8192
 MinSize            : 4
 Time to run (sec)  : 10
 Idelay             : 0
 No Write           : 0
 Loopdelay          : 0
 Delay              : 0
 Bad                : 0
 Check              : 0
 Mcheck             : 0
 Max nodes          : 256
 rnl                : 0
 Callbacks          : Yes
 ----------------------------
 Probing all nodes
 Response from remote node 4
 Response from remote node 8
 Response from remote node 12
 Response from remote node 68
 Response from remote node 72
 Response from remote node 132
 Response from remote node 136
 Response from remote node 140
 Local segment (id=4, size=8192) is created.
 Local segment (id=4, size=8192) is shared.
 Local segment (id=8, size=8192) is created.
 Local segment (id=8, size=8192) is shared.
 Local segment (id=12, size=8192) is created.
 Local segment (id=12, size=8192) is shared.
 Local segment (id=68, size=8192) is created.
 Local segment (id=68, size=8192) is shared.
 Local segment (id=72, size=8192) is created.
 Local segment (id=72, size=8192) is shared.
 Local segment (id=132, size=8192) is created.
 Local segment (id=132, size=8192) is shared.
 Local segment (id=136, size=8192) is created.
 Local segment (id=136, size=8192) is shared.
 Local segment (id=140, size=8192) is created.
 Local segment (id=140, size=8192) is shared.
 Connecting to 8 nodes
 Connect to remote segment, node 4
 Remote segment on node 4 is connected.
 Connect to remote segment, node 8
 Remote segment on node 8 is connected.
 Connect to remote segment, node 12
 Remote segment on node 12 is connected.
 Connect to remote segment, node 68
 Remote segment on node 68 is connected.
 Connect to remote segment, node 72
 Remote segment on node 72 is connected.
 Connect to remote segment, node 132
 Remote segment on node 132 is connected.
 Connect to remote segment, node 136
 Remote segment on node 136 is connected.
 Connect to remote segment, node 140
 Remote segment on node 140 is connected.
 SCICONNTEST_REPORT
 NUM_TESTLOOPS_EXECUTED    1
 NUM_NODES_FOUND           8
 NUM_ERRORS_DETECTED       0
 node 4 : Found
 node 4 : Number of failures : 0
 node 4 : Longest failure    :    0.00 (ms)
 node 8 : Found
 node 8 : Number of failures : 0
 node 8 : Longest failure    :    0.00 (ms)
 node 12 : Found
 node 12 : Number of failures : 0
 node 12 : Longest failure    :    0.00 (ms)
 node 68 : Found
 node 68 : Number of failures : 0
 node 68 : Longest failure    :    0.00 (ms)
 node 72 : Found
 node 72 : Number of failures : 0
 node 72 : Longest failure    :    0.00 (ms)
 node 132 : Found
 node 132 : Number of failures : 0
 node 132 : Longest failure    :    0.00 (ms)
 node 136 : Found
 node 136 : Number of failures : 0
 node 136 : Longest failure    :    0.00 (ms)
 node 140 : Found
 node 140 : Number of failures : 0
 node 140 : Longest failure    :    0.00 (ms)
 SCICONNTEST_REPORT_END

 SCI_CB_DISCONNECT:Segment removed on the other node disconnecting.....

The test passes if all Cluster Nodes report 0 failures for all remote Cluster Nodes. If the test identifies any failures, you can determine the closest pair(s) of Cluster Nodes for which these failures are reported and check the cabled connection between them. The numerical NodeIds shown in this output are the Cluster NodeIds of the adapters (which identify an adapter in the PCI Express interconnect).

Although this test can be run while a system is in production, but you have to take into account that performance of the productive applications will be reduced significantly while this test is running. If links actually show problems, they might be temporarily disabled, stopping all communication until rerouting takes place.

1.5. Interconnect Performance Test

Once the correct installation and setup and the basic functionality of the interconnect have been verified, it is possible to perform a set of low-level benchmarks to determine the base-line performance of the interconnect without any additional software layers. The tests that are relevant for this are scibench2 (streaming remote memory PIO access performance), scipp (request-response remote memory PIO write performance), dma_bench (streaming remote memory DMA access performance) and intr_bench (remote interrupt performance).

All these tests need to run on two Cluster Nodes (A and B) and are started in the same manner:

  1. Determine the NodeId of both Cluster Nodes using the query command (default path C:\Program Files\Dolphin Express PX\Examples\bin\query). The NodeId is reported as "Local node-id".

  2. On node A, start the server-side benchmark with the options -server and -rn <NodeId of B>, like:

    $ scibench2 -server -rn 8
  3. On Cluster Node B, start the client-side benchmark with the options -client and -rn <NodeId of A>, like:

    $ scibench2 -client -rn 4
  4. The test results are reported by the client.

scibench2

Scibench2 measures the streaming bandwidth using CPU based PIO transfers (memcopy)

The following results are measured using the PXH810 card (Gen3, x8)

---------------------------------------------------------------
Segment Size:    Average Send Latency:        Throughput:
---------------------------------------------------------------
      4              0.07 us                58.31 MBytes/s
      8              0.07 us               117.14 MBytes/s
     16              0.07 us               231.06 MBytes/s
     32              0.07 us               445.08 MBytes/s
     64              0.08 us               838.84 MBytes/s
    128              0.09 us              1483.27 MBytes/s
    256              0.11 us              2408.40 MBytes/s
    512              0.15 us              3497.44 MBytes/s
   1024              0.23 us              4530.20 MBytes/s
   2048              0.39 us              5294.99 MBytes/s
   4096              0.77 us              5308.03 MBytes/s
   8192              1.54 us              5306.65 MBytes/s
  16384              3.10 us              5291.49 MBytes/s
  32768              6.19 us              5294.48 MBytes/s
  65536             12.39 us              5289.90 MBytes/s
	      

Average Send latency is the wall time to write 4 bytes to remote memory

Throughput is the streaming performance using PIO writes to remote memory.

dma_bench

dma_bench measures the streaming DMA bandwidth available through the SISCI API.

The following results are measured using the PXH830 card (Gen3, x16)

-------------------------------------------------------------------------------
  Message    Total     Vector     Transfer        Latency         Bandwidth
   size      size      length       time        per message       
-------------------------------------------------------------------------------
     64      16384      256        35.76 us       0.14 us       458.18 MBytes/s
    128      32768      256        36.81 us       0.14 us       890.24 MBytes/s
    256      65536      256        37.16 us       0.15 us      1763.43 MBytes/s
    512     131072      256        39.36 us       0.15 us      3329.83 MBytes/s
   1024     262144      256        41.34 us       0.16 us      6340.40 MBytes/s
   2048     524288      256        54.75 us       0.21 us      9576.21 MBytes/s
   4096     524288      128        51.83 us       0.40 us      10116.51 MBytes/s
   8192     524288       64        50.46 us       0.79 us      10390.38 MBytes/s
  16384     524288       32        49.69 us       1.55 us      10551.60 MBytes/s
  32768     524288       16        49.30 us       3.08 us      10634.86 MBytes/s
  65536     524288        8        49.07 us       6.13 us      10684.71 MBytes/s
 131072     524288        4        48.90 us      12.23 us      10721.20 MBytes/s
 262144     524288        2        48.89 us      24.44 us      10724.27 MBytes/s
 524288     524288        1        48.98 us      48.98 us      10704.78 MBytes/s
	      
---------------------------------------------------------------
Segment Size:    Average Send Latency:        Throughput:
---------------------------------------------------------------
      4              0.07 us                58.31 MBytes/s
      8              0.07 us               117.14 MBytes/s
     16              0.07 us               231.06 MBytes/s
     32              0.07 us               445.08 MBytes/s
     64              0.08 us               838.84 MBytes/s
    128              0.09 us              1483.27 MBytes/s
    256              0.11 us              2408.40 MBytes/s
    512              0.15 us              3497.44 MBytes/s
   1024              0.23 us              4530.20 MBytes/s
   2048              0.39 us              5294.99 MBytes/s
   4096              0.77 us              5308.03 MBytes/s
   8192              1.54 us              5306.65 MBytes/s
  16384              3.10 us              5291.49 MBytes/s
  32768              6.19 us              5294.48 MBytes/s
  65536             12.39 us              5289.90 MBytes/s
	      
scipp

The scipp SISCI benchmark sends a message of the specified size to the remote system. The remote system is polling for incoming data and will send a similar message back to the first node.

The minimal round-trip latency for writing to remote memory is extremely low using PCI Express networks.

The following results are typical for a PCI Express Gen3 x8 link

Ping Pong data transfer:
  size       retries  latency (usec)  latency/2 (usec)
     0          2486           1.079             0.539
     4          2406           1.078             0.539
     8          2442           1.090             0.545
    16          2454           1.098             0.549
    32          2482           1.117             0.558
    64          2562           1.151             0.575
   128          2608           1.176             0.588
   256          2667           1.247             0.624
   512          2866           1.331             0.666
  1024          3064           1.492             0.746
  2048          3773           1.880             0.940
  4096          4850           2.659             1.330
  8192          7364           4.247             2.123
	      
intr_bench

The interrupt latency is affected by the operating system and can therefore vary.

Average unidirectional interrupt time :        2.515 us.
Average round trip     interrupt time :        5.030 us.