Tutorial

Advanced Experiments

1 HPC experiment: HPCC benchmark
2 Large scale experiment
3 Fault injection experiment

1 HPC experiment: HPCC benchmark

The goal of this experiment is to run the HPCC benchmark on 16 virtual nodes and to evaluate the impact of using slower nodes in different configurations. This tutorial is split in 3 steps for two different methods (shell and scripted) :

requirements ;
platform setup ;
experiment.

1.1 Requirements

In this experiment, we use 4 physical nodes of the same cluster. In this example, we will use the Graphene cluster since each physical node has 4 cores. However, any cluster with at least 4 nodes can do the job. You can reserve the nodes for two hours as follows:

    frontend> oarsub -t deploy \
              -l slash_22=1+{"cluster='graphene'"}nodes=4,walltime=2:00:00 -I

For the tutorial, we assume that the reserved nodes are graphene-1, graphene-2, graphene-3 and graphene-4.

You can check the reserved network addresses as specified in Make a reservation. Deployment of the physical nodes is working the same way it is working in Preparing physical machines. Finally you should install Distem on the physical machines as specified in Distem Installation.

In the example, we assume that the coordinator is graphene-1 and the virtual network obtained is 10.144.0.0/22.

1.2 Simple shell experiment

1.2.1 Platform setup

We decided to set up our platform using the shell way, but it could also be done with a script (such as we did in Platform setup).

To setup the platform, we start by connecting on the coordinator node with the root user:

    frontend> ssh root@graphene-1

First of all, we must create a virtual network with the virtual network address obtained.

    coord> distem --create-vnetwork vnetwork=vnetwork,address=10.144.0.0/22

Then, we must create the virtual nodes. We will create 4 nodes on each physical node, called node-1 to node-16.

    coord> export FS_IMG=file:///home/USER/distem_img/distem-fs-jessie.tar.gz"
    coord> for i in `seq 1 4`; do \
            distem --create-vnode vnode=node-${i},pnode=graphene-1,rootfs=${FS_IMG},\
            sshprivkey=/root/.ssh/id_rsa,sshpubkey=/root/.ssh/id_rsa.pub; \
           done
    coord> for i in `seq 5 8`; do \
            distem --create-vnode vnode=node-${i},pnode=graphene-2,rootfs=${FS_IMG},\
            sshprivkey=/root/.ssh/id_rsa,sshpubkey=/root/.ssh/id_rsa.pub; \
           done
    coord> for i in `seq 9 12`; do \
            distem --create-vnode vnode=node-${i},pnode=graphene-3,rootfs=${FS_IMG},\
            sshprivkey=/root/.ssh/id_rsa,sshpubkey=/root/.ssh/id_rsa.pub; \
           done
    coord> for i in `seq 13 16`; do \
            distem --create-vnode vnode=node-${i},pnode=graphene-4,rootfs=${FS_IMG},\
            sshprivkey=/root/.ssh/id_rsa,sshpubkey=/root/.ssh/id_rsa.pub; \
           done

Now we create the network interfaces on each virtual node:

    coord> for i in `seq 1 16`; do \
            distem --create-viface vnode=node-${i},iface=if0,vnetwork=vnetwork; \
           done

Next, we create the virtual processors on the virtual nodes (here we define 1 core per virtual node that runs at full speed):

    coord> for i in `seq 1 16`; do \
            distem --set-vcpu vnode=node-${i},corenb=1,cpu_speed=unlimited; \
           done

To ensure that all goes well, you can get the information about the configured virtual nodes.

Finally, we start the virtual nodes:

    coord> for i in `seq 1 16`; do distem --start-vnode node-${i}; done

1.2.2 Experiment

We assume that the IP for the virtual node node-X is 10.144.0.X.

Let’s connect on node-1 (distem --shell node-1) and add the following lines to ~/.ssh/config:

Host *
  StrictHostKeyChecking no
  HashKnownHosts no

Now, you can perform the following commands:

node-1> for i in `seq 1 16`; do echo 10.144.0.$i >> iplist; done
node-1> time mpiexec -machinefile iplist hpcc

This will launch the HPCC benchmark over the virtual nodes. You observe the global execution time for the benchmark and compare it to experimentation where you choose a different CPU virtual frequency. You can also have a look to the hpccoutf.txt generated file that contains the details of the results for each sub-benchmark.

You can now run the test with other frequencies by first updating them this way :

    coord> for i in `seq 1 16`; do \
            distem --config-vcpu vnode=node-${i},cpu_speed=0.5,unit=ratio; \
           done

Note that you don’t have to restart the virtual node for this update to take effect, it’s done on-the-fly

Now relaunch your experience by first connecting on node-1 (distem –shell node-1) and then performing the following commands:

    node-1> time mpiexec -machinefile iplist hpcc

1.3 Scripted experiment

1.3.1 Platform setup

    frontend> distem-bootstrap --node-list $OAR_NODE_FILE

    coord> ruby platform_setup.rb "10.144.0.0/22" 0.5

The first parameter of platform_setup.rb is the virtual network address allocated to your reservation and the second parameter is the coefficient applied to the CPU frequency (for instance a coefficient of 0.5 means that the virtual cores will work at the half of their real speed).

Please note that this time we will start the virtual nodes in asynchronous mode to win some time since there is a lot of virtual node to start and since the start operation is the longest operation.

Here is the source of this script:

#!/usr/bin/ruby
# Import the Distem module
require 'distem'
# The path to the compressed filesystem image
# We can point to local file since our homedir is available from NFS
FSIMG="file:///home/USER/distem_img/distem-fs-jessie.tar.gz""
# Put the physical machines that have been assigned to you
# You can get that by executing: cat $OAR_NODE_FILE | uniq
pnodes=["pnode1","pnode2", ... ]
# The first argument of the script is the address (in CIDR format)
# of the virtual network to set-up in our platform
vnet = {
  'name' => 'testnet',
  'address' => ARGV[0]
}
# The second argument of the script is the is the coefficient
# applied to the CPU frequency of the physical machine
cpu_limit = ARGV[1].to_f
nodelist = []
# Connect to the Distem server (on http://localhost:4567 by default)
Distem.client do |cl|
  puts 'Creating virtual network'
  # Start by creating the virtual network
  cl.vnetwork_create(vnet['name'], vnet['address'])
  puts 'Creating virtual nodes'
  count = 0
  # Read SSH keys
  private_key = IO.readlines('/root/.ssh/id_rsa').join
  public_key = IO.readlines('/root/.ssh/id_rsa.pub').join
  sshkeys = {
    'private' => private_key,
    'public' => public_key
  }
  # Iterate on every physical nodes
  pnodes.each do |pnode|
    # Create 4 virtual nodes per physical machine (one per core)
    4.times do
      nodename = "node-#{count}"
      # Create the first virtual node and set it to be hosted on 'pnode'
      cl.vnode_create(nodename, { 'host' => pnode }, sshkeys)
      # Specify the path to the compressed filesystem image
      # of this virtual node
      cl.vfilesystem_create(nodename, { 'image' => FSIMG })
      # Create a virtual CPU with 1 core on this virtual node
      # specifying that its frequency should be 'cpu_limit'
      cl.vcpu_create(nodename, cpu_limit, 'ratio', 1)
      # Create a virtual network interface and connect it to vnet
      cl.viface_create(nodename, 'if0', { 'vnetwork' => vnet['name'], 'default' => 'true' })
      nodelist << nodename
      count += 1
    end
  end
  puts 'Starting virtual nodes ...'
  nodelist.each do |nodename|
    cl.vnode_start(nodename)
  end
  puts 'done'
end

1.3.2 Experiment

The experiment can be launched from root on the coordinator node with the following command:

    coord> ruby experiment.rb

Here is the code of experiment.rb:

#!/usr/bin/ruby
require 'distem'
# Function that perform the calculation of the average
# of an array of values
def average(values)
  sum = values.inject(0){ |tmpsum,v| tmpsum + v.to_f }
  return sum / values.size
end
# Function that perform the calculation of the standard deviation
# of an array of values
def stddev(values,avg = nil)
  avg = average(values) unless avg
  sum = values.inject(0){ |tmpsum,v| tmpsum + ((v.to_f-avg) ** 2) }
  return Math.sqrt(sum / values.size)
end
# Describing the resources we are working with
ifname = 'if0'
# The virtual nodes list
nodelist = []
16.times do |count|
  nodelist << "node-#{count}"
end
iplist = []
results = []
iterations = 5
Distem.client do |cl|
  # Getting the -automatically affected- address of each virtual nodes
  # virtual network interfaces
  nodelist.each do |nodename|
    iplist << cl.viface_info(nodename,ifname)['address'].split('/')[0]
  end
  # Creating a string with each ip on a single line
  ipliststr = iplist.join("\n")
  # Copying the iplist in a file on the first node
  cl.vnode_execute(nodelist[0], "echo '#{ipliststr}' >> iplist")
  puts 'Starting tests'
  iterations.times do |iter|
    puts "\tIteration #{iter}"
    start_time = Time.now
    cl.vnode_execute(nodelist[0], 'mpiexec -machinefile iplist hpcc')
    results << Time.now - start_time
  end
end
avg = average(results)
puts "Results: [average=#{avg},standard_deviation=#{stddev(results,avg)}]"

As in the non-scripted version, to perform experiment with several CPU speeds, you can update the CPU speed using the vcpu_update method.

2 Large scale experiment

The goal of this experiment is to run a large scale experiment on 1000 virtual nodes. This will illustrate some Distem features that help to create in a short time a large virtual platform.

First, we reserve 10 physical nodes for two hours as follows:

    frontend> oarsub -t deploy \
              -l slash_18=1+{"cluster='graphene'"}nodes=10,walltime=2:00:00 -I

We are going to create a script, platform_setup.rb, to set up the platform:

#!/usr/bin/ruby
require 'distem'
require 'thread'
net,netmask = ARGV[0].split('/')
nb_vnodes = ARGV[1].to_i
img = "file:///home/USER/distem_img/distem-fs-jessie.tar.gz""
nodes = []
iplist = []
Distem.client { |cl|
  cl.vnetwork_create('vnet', "#{net}/#{netmask}")
  (1..nb_vnodes).each { |i| nodes << "node#{i}" }
  res = cl.vnodes_create(nodes,
                        {
                          'vfilesystem' =>{'image' => img,'shared' => true},
                          'vifaces' => [{'name' => 'if0', 'vnetwork' => 'vnet', 'default' => 'true'}]
                        })
  # Not used further, but could be useful in such script
  res.each { |r| iplist << r['vifaces'][0]['address'].split('/')[0] }
  puts "Starting vnodes..."
  cl.vnodes_start(nodes)
  sleep(30)
  puts "Waiting for vnodes to be here..."
  if cl.wait_vnodes({'timeout' => 600, 'port' => 22})
    puts "Setting global /etc/hosts"
    cl.set_global_etchosts()
    puts "Setting global ARP tables"
    cl.set_global_arptable()
  else
    puts "vnodes are unreachable"
    exit 1
  end
}

You can notice that this script uses vnodes_create() and vnodes_start() instead of the vnode_create() and vnode_start() functions. Actually these are the vectorized versions of the previous functions that avoid a lot of HTTP requests, and thus that drastically speed-up the platform creation when dealing with hundreds or thousands of nodes.

This script also call two functions that may help for your experiment

set_global_etchosts() fills the /etc/hosts of every virtual node in order to be able to use directly the name of the virtual nodes instead of their IP. Indeed, when using several physical nodes, or not shared filesystem, /etc/hosts are not globally filled.
set_global_arptable() fills the ARP table of every virtual node with all the MAC addresses of the virtual nodes in the platform. This is useful for large scale experiments since it avoids a lot of ARP requests that may lead to connection failure.

Then, you can deploy the platform with:

    frontend> distem-bootstrap --node-list $OAR_NODE_FILE --max-vifaces 150

    coord> ruby platform_setup.rb "10.144.0.0/18" 1000

Note that –max-vifaces is used to specify the maximum number of virtual interfaces that can be launched on a physical node, by default it is 64. As we asked to deploy 1000 virtual nodes on 10 physical nodes, in average 100 virtual interfaces will be created, so we set the parameter to 150 just in case of a bad random distribution of the virtual nodes.

Finally, it is up to you to run a large scale experiment :)

3 Fault injection experiment

Distem provides users with an event manager to automatically modify the virtual platform in a deterministic way. Supported modifications are:

modification of network interfaces capabilities (bandwidth and latency)
modification of the CPU frequency
start and stop virtual nodes
freeze and unfreeze virtual nodes

Events can be specified in two ways. First, it is possible to use an event trace that specifies which modification occurs at which date (relatively to the start of the experiment). Second, it is possible to define automatically the date of event arrival according to various probability distributions. Currently, uniform, exponential and Weibull distributions are supported.

The goal of this experiment is to evaluate a fault-tolerant file broadcast tool called kascade. In particular, we will evaluate its behavior when introducing node failures.

First, we reserve 10 physical nodes for two hours as follows:

    frontend> oarsub -t deploy \
              -l slash_18=1+{"cluster='graphene'"}nodes=10,walltime=2:00:00 -I

You can check the reserved network addresses as specified in Make a reservation.

Deployment of the physical nodes is working the same way it is working in Preparing physical machines.

We create a script, platform_setup.rb, to set up the platform:

#!/usr/bin/ruby
require 'distem'
require 'thread'
net,netmask = ARGV[0].split('/')
nb_vnodes = ARGV[1].to_i
img = "file:///home/USER/distem_img/distem-fs-jessie.tar.gz""
nodes = []
iplist = []
Distem.client { |cl|
  cl.vnetwork_create('vnet', "#{net}/#{netmask}")
  (1..nb_vnodes).each { |i| nodes << "node#{i}" }
  res = cl.vnodes_create(nodes,
                        {
                          'vfilesystem' =>{'image' => img,'shared' => true},
                          'vifaces' => [{'name' => 'if0', 'vnetwork' => 'vnet', 'default' => 'true'}]
                        })
  # Not used further, but could be useful in such script
  res.each { |r| iplist << r['vifaces'][0]['address'].split('/')[0] }
  puts "Starting vnodes..."
  cl.vnodes_start(nodes)
  sleep(30)
  puts "Waiting for vnodes to be here..."
  if cl.wait_vnodes({'timeout' => 600, 'port' => 22})
    puts "Setting global /etc/hosts"
    cl.set_global_etchosts()
    puts "Setting global ARP tables"
    cl.set_global_arptable()
  else
    puts "vnodes are unreachable"
    exit 1
  end
}

Then, you can deploy a 50 virtual nodes platform with:

    coord> ruby platform_setup.rb "10.144.0.0/18" 50

The experiment can be launched from the coordinator node with the following command:

    coord> ruby experiment.rb /path/to/kascade

Here is the code of experiment.rb:

#!/usr/bin/ruby
require 'pp'
require 'distem'
require 'tempfile'
require 'rubygems'
require 'net/ssh'
NBVNODES = 50
REPS = 3
PATH_TO_KASCADE = ARGV[0]
nodes = (1..NBVNODES).collect {|i| "node#{i}"}
# 3 experiments are launched here:
# - run without failure
# - run with simultaneous failures of 5% of the nodes
# - run with sequential failures of 5% of the nodes
EXP = [ { :name => 'no_failure', :trace => nil },
        { :name => 'simult_5percent', :trace => [ [10, [4,14,24,34,44]] ] },
        { :name => 'seq_5percent', :trace => [ [10, [4]], [14, [14]], [18, [24]],
                                               [22, [34]], [26, [44]] ] } ]
results = {}
# Create a node file for Kascade
f = Tempfile.new('kascade_ft')
nodes.drop(1).each { |node| f.puts(node) }
f.close
# Copy the node file into the first virtual node
system("scp #{f.path} root@node1:nodes")
# Copy Kascade into the first virtual node
system("scp #{PATH_TO_KASCADE} root@node1:kascade")
# Add execution rights to kascade
system("ssh root@node1 'chmod +x ~/kascade'")
# Generate a 500MB file
system("ssh root@node1 'dd if=/dev/zero of=/tmp/file bs=1M count=500'")
Distem.client { |cl|
  EXP.each { |experiment|
    results[experiment[:name]] = []
    # Run the experiments several time
    REPS.times.each { |iter|
      puts "### Experiment #{experiment[:name]}, iteration #{iter}"
      nodes_down = []
      trace = experiment[:trace]
      # Check if events have to be injected
      if trace
        trace.each { |dates|
          date,node_numbers = dates
          nodes_down += node_numbers.collect { |number| "node#{number}" }
          node_numbers.each { |number|
            cl.event_trace_add({ 'vnodename' => "node#{number}", 'type' => 'vnode' },
                               'churn',
                               { date => 'down' })
          }
        }
        cl.event_manager_start
      end
      # Perform a run
      Net::SSH.start('node1', 'root', :password => 'root') {|ssh|
        start = Time.now.to_f
        ssh.exec('/root/kascade -n /root/nodes -i /tmp/file -o /dev/null -D taktuk -v fatal')
        ssh.loop
        results[experiment[:name]] << Time.now.to_f - start
      }
      # Clean
      if trace
        cl.event_manager_stop
        puts "Let's restart #{nodes_down.join(',')}"
        cl.vnodes_start(nodes_down)
        ret = cl.wait_vnodes({'timeout' => 120, 'port' => 22, 'vnodes' => nodes_down})
        if not ret
          puts "Some nodes are unreachable"
          exit 1
        end
      end
    }
  }
}
pp results

In this experiment the failure consists in stopping some virtual nodes at a given time. Other strategies could have been used, for instance we could have simulated a network issue where the network interface of some virtual nodes would have become unresponsive (without being completely shut down). This could have been achieved by setting a high latency to those network interfaces. The following code:

cl.event_trace_add({ 'vnodename' => "node#{number}", 'type' => 'vnode' },
                     'churn',
                     { date => 'down' })

could have been replaced for instance with:

cl.event_trace_add({ 'vnodename' => "node#{number}",
                     'type' => 'viface',
                     'vifacename' => 'if0',
                     'viface_direction' => 'output' },
                     'latency',
                     { date => '200000ms' })

In this case, we have added a latency of 200s on some network interfaces (up-link way), leading to almost unresponsive nodes.