Tutorial
Advanced Experiments
1 HPC experiment: HPCC benchmark
The goal of this experiment is to run the HPCC benchmark on 16 virtual nodes and to evaluate the impact of using slower nodes in different configurations. This tutorial is split in 3 steps for two different methods (shell and scripted) :
- requirements ;
- platform setup ;
- experiment.
1.1 Requirements
In this experiment, we use 4 physical nodes of the same cluster. In this example, we will use the Graphene cluster since each physical node has 4 cores. However, any cluster with at least 4 nodes can do the job. You can reserve the nodes for two hours as follows:
For the tutorial, we assume that the reserved nodes are graphene-1, graphene-2, graphene-3 and graphene-4.
You can check the reserved network addresses as specified in Make a reservation. Deployment of the physical nodes is working the same way it is working in Preparing physical machines. Finally you should install Distem on the physical machines as specified in Distem Installation.
In the example, we assume that the coordinator is graphene-1 and the virtual network obtained is 10.144.0.0/22.
1.2 Simple shell experiment
1.2.1 Platform setup
We decided to set up our platform using the shell way, but it could also be done with a script (such as we did in Platform setup).
To setup the platform, we start by connecting on the coordinator node with the root user:
First of all, we must create a virtual network with the virtual network address obtained.
Then, we must create the virtual nodes. We will create 4 nodes on each physical node, called node-1 to node-16.
coord> export FS_IMG=file:///home/USER/distem_img/distem-fs-jessie.tar.gz"
coord> for i in `seq 1 4`; do \
distem --create-vnode vnode=node-${i},pnode=graphene-1,rootfs=${FS_IMG},\
sshprivkey=/root/.ssh/id_rsa,sshpubkey=/root/.ssh/id_rsa.pub; \
done
coord> for i in `seq 5 8`; do \
distem --create-vnode vnode=node-${i},pnode=graphene-2,rootfs=${FS_IMG},\
sshprivkey=/root/.ssh/id_rsa,sshpubkey=/root/.ssh/id_rsa.pub; \
done
coord> for i in `seq 9 12`; do \
distem --create-vnode vnode=node-${i},pnode=graphene-3,rootfs=${FS_IMG},\
sshprivkey=/root/.ssh/id_rsa,sshpubkey=/root/.ssh/id_rsa.pub; \
done
coord> for i in `seq 13 16`; do \
distem --create-vnode vnode=node-${i},pnode=graphene-4,rootfs=${FS_IMG},\
sshprivkey=/root/.ssh/id_rsa,sshpubkey=/root/.ssh/id_rsa.pub; \
done
Now we create the network interfaces on each virtual node:
coord> for i in `seq 1 16`; do \
distem --create-viface vnode=node-${i},iface=if0,vnetwork=vnetwork; \
done
Next, we create the virtual processors on the virtual nodes (here we define 1 core per virtual node that runs at full speed):
coord> for i in `seq 1 16`; do \
distem --set-vcpu vnode=node-${i},corenb=1,cpu_speed=unlimited; \
done
To ensure that all goes well, you can get the information about the configured virtual nodes.
Finally, we start the virtual nodes:
1.2.2 Experiment
We assume that the IP for the virtual node node-X is 10.144.0.X.
Let’s connect on node-1 (distem --shell node-1) and add the following lines to ~/.ssh/config:
Host *
StrictHostKeyChecking no
HashKnownHosts no
Now, you can perform the following commands:
node-1> for i in `seq 1 16`; do echo 10.144.0.$i >> iplist; done
node-1> time mpiexec -machinefile iplist hpcc
This will launch the HPCC benchmark over the virtual nodes. You observe the global execution time for the benchmark and compare it to experimentation where you choose a different CPU virtual frequency. You can also have a look to the hpccoutf.txt generated file that contains the details of the results for each sub-benchmark.
You can now run the test with other frequencies by first updating them this way :
coord> for i in `seq 1 16`; do \
distem --config-vcpu vnode=node-${i},cpu_speed=0.5,unit=ratio; \
done
Note that you don’t have to restart the virtual node for this update to take effect, it’s done on-the-fly
Now relaunch your experience by first connecting on node-1 (distem –shell node-1) and then performing the following commands:
1.3 Scripted experiment
1.3.1 Platform setup
The first parameter of platform_setup.rb is the virtual network address allocated to your reservation and the second parameter is the coefficient applied to the CPU frequency (for instance a coefficient of 0.5 means that the virtual cores will work at the half of their real speed).
Please note that this time we will start the virtual nodes in asynchronous mode to win some time since there is a lot of virtual node to start and since the start operation is the longest operation.
Here is the source of this script:
#!/usr/bin/ruby
# Import the Distem module
require 'distem'
# The path to the compressed filesystem image
# We can point to local file since our homedir is available from NFS
FSIMG="file:///home/USER/distem_img/distem-fs-jessie.tar.gz""
# Put the physical machines that have been assigned to you
# You can get that by executing: cat $OAR_NODE_FILE | uniq
pnodes=["pnode1","pnode2", ... ]
# The first argument of the script is the address (in CIDR format)
# of the virtual network to set-up in our platform
vnet = {
'name' => 'testnet',
'address' => ARGV[0]
}
# The second argument of the script is the is the coefficient
# applied to the CPU frequency of the physical machine
cpu_limit = ARGV[1].to_f
nodelist = []
# Connect to the Distem server (on http://localhost:4567 by default)
Distem.client do |cl|
puts 'Creating virtual network'
# Start by creating the virtual network
cl.vnetwork_create(vnet['name'], vnet['address'])
puts 'Creating virtual nodes'
count = 0
# Read SSH keys
private_key = IO.readlines('/root/.ssh/id_rsa').join
public_key = IO.readlines('/root/.ssh/id_rsa.pub').join
sshkeys = {
'private' => private_key,
'public' => public_key
}
# Iterate on every physical nodes
pnodes.each do |pnode|
# Create 4 virtual nodes per physical machine (one per core)
4.times do
nodename = "node-#{count}"
# Create the first virtual node and set it to be hosted on 'pnode'
cl.vnode_create(nodename, { 'host' => pnode }, sshkeys)
# Specify the path to the compressed filesystem image
# of this virtual node
cl.vfilesystem_create(nodename, { 'image' => FSIMG })
# Create a virtual CPU with 1 core on this virtual node
# specifying that its frequency should be 'cpu_limit'
cl.vcpu_create(nodename, cpu_limit, 'ratio', 1)
# Create a virtual network interface and connect it to vnet
cl.viface_create(nodename, 'if0', { 'vnetwork' => vnet['name'], 'default' => 'true' })
nodelist << nodename
count += 1
end
end
puts 'Starting virtual nodes ...'
nodelist.each do |nodename|
cl.vnode_start(nodename)
end
puts 'done'
end
1.3.2 Experiment
The experiment can be launched from root on the coordinator node with the following command:
Here is the code of experiment.rb:
#!/usr/bin/ruby
require 'distem'
# Function that perform the calculation of the average
# of an array of values
def average(values)
sum = values.inject(0){ |tmpsum,v| tmpsum + v.to_f }
return sum / values.size
end
# Function that perform the calculation of the standard deviation
# of an array of values
def stddev(values,avg = nil)
avg = average(values) unless avg
sum = values.inject(0){ |tmpsum,v| tmpsum + ((v.to_f-avg) ** 2) }
return Math.sqrt(sum / values.size)
end
# Describing the resources we are working with
ifname = 'if0'
# The virtual nodes list
nodelist = []
16.times do |count|
nodelist << "node-#{count}"
end
iplist = []
results = []
iterations = 5
Distem.client do |cl|
# Getting the -automatically affected- address of each virtual nodes
# virtual network interfaces
nodelist.each do |nodename|
iplist << cl.viface_info(nodename,ifname)['address'].split('/')[0]
end
# Creating a string with each ip on a single line
ipliststr = iplist.join("\n")
# Copying the iplist in a file on the first node
cl.vnode_execute(nodelist[0], "echo '#{ipliststr}' >> iplist")
puts 'Starting tests'
iterations.times do |iter|
puts "\tIteration #{iter}"
start_time = Time.now
cl.vnode_execute(nodelist[0], 'mpiexec -machinefile iplist hpcc')
results << Time.now - start_time
end
end
avg = average(results)
puts "Results: [average=#{avg},standard_deviation=#{stddev(results,avg)}]"
As in the non-scripted version, to perform experiment with several CPU speeds, you can update the CPU speed using the vcpu_update method.
2 Large scale experiment
The goal of this experiment is to run a large scale experiment on 1000 virtual nodes. This will illustrate some Distem features that help to create in a short time a large virtual platform.
First, we reserve 10 physical nodes for two hours as follows:
We are going to create a script, platform_setup.rb, to set up the platform:
#!/usr/bin/ruby
require 'distem'
require 'thread'
net,netmask = ARGV[0].split('/')
nb_vnodes = ARGV[1].to_i
img = "file:///home/USER/distem_img/distem-fs-jessie.tar.gz""
nodes = []
iplist = []
Distem.client { |cl|
cl.vnetwork_create('vnet', "#{net}/#{netmask}")
(1..nb_vnodes).each { |i| nodes << "node#{i}" }
res = cl.vnodes_create(nodes,
{
'vfilesystem' =>{'image' => img,'shared' => true},
'vifaces' => [{'name' => 'if0', 'vnetwork' => 'vnet', 'default' => 'true'}]
})
# Not used further, but could be useful in such script
res.each { |r| iplist << r['vifaces'][0]['address'].split('/')[0] }
puts "Starting vnodes..."
cl.vnodes_start(nodes)
sleep(30)
puts "Waiting for vnodes to be here..."
if cl.wait_vnodes({'timeout' => 600, 'port' => 22})
puts "Setting global /etc/hosts"
cl.set_global_etchosts()
puts "Setting global ARP tables"
cl.set_global_arptable()
else
puts "vnodes are unreachable"
exit 1
end
}
You can notice that this script uses vnodes_create() and vnodes_start() instead of the vnode_create() and vnode_start() functions. Actually these are the vectorized versions of the previous functions that avoid a lot of HTTP requests, and thus that drastically speed-up the platform creation when dealing with hundreds or thousands of nodes.
This script also call two functions that may help for your experiment
- set_global_etchosts() fills the /etc/hosts of every virtual node in order to be able to use directly the name of the virtual nodes instead of their IP. Indeed, when using several physical nodes, or not shared filesystem, /etc/hosts are not globally filled.
- set_global_arptable() fills the ARP table of every virtual node with all the MAC addresses of the virtual nodes in the platform. This is useful for large scale experiments since it avoids a lot of ARP requests that may lead to connection failure.
Then, you can deploy the platform with:
Note that –max-vifaces is used to specify the maximum number of virtual interfaces that can be launched on a physical node, by default it is 64. As we asked to deploy 1000 virtual nodes on 10 physical nodes, in average 100 virtual interfaces will be created, so we set the parameter to 150 just in case of a bad random distribution of the virtual nodes.
Finally, it is up to you to run a large scale experiment :)
3 Fault injection experiment
Distem provides users with an event manager to automatically modify the virtual platform in a deterministic way. Supported modifications are:
- modification of network interfaces capabilities (bandwidth and latency)
- modification of the CPU frequency
- start and stop virtual nodes
- freeze and unfreeze virtual nodes
Events can be specified in two ways. First, it is possible to use an event trace that specifies which modification occurs at which date (relatively to the start of the experiment). Second, it is possible to define automatically the date of event arrival according to various probability distributions. Currently, uniform, exponential and Weibull distributions are supported.
The goal of this experiment is to evaluate a fault-tolerant file broadcast tool called kascade. In particular, we will evaluate its behavior when introducing node failures.
First, we reserve 10 physical nodes for two hours as follows:
You can check the reserved network addresses as specified in Make a reservation.
Deployment of the physical nodes is working the same way it is working in Preparing physical machines.
We create a script, platform_setup.rb, to set up the platform:
#!/usr/bin/ruby
require 'distem'
require 'thread'
net,netmask = ARGV[0].split('/')
nb_vnodes = ARGV[1].to_i
img = "file:///home/USER/distem_img/distem-fs-jessie.tar.gz""
nodes = []
iplist = []
Distem.client { |cl|
cl.vnetwork_create('vnet', "#{net}/#{netmask}")
(1..nb_vnodes).each { |i| nodes << "node#{i}" }
res = cl.vnodes_create(nodes,
{
'vfilesystem' =>{'image' => img,'shared' => true},
'vifaces' => [{'name' => 'if0', 'vnetwork' => 'vnet', 'default' => 'true'}]
})
# Not used further, but could be useful in such script
res.each { |r| iplist << r['vifaces'][0]['address'].split('/')[0] }
puts "Starting vnodes..."
cl.vnodes_start(nodes)
sleep(30)
puts "Waiting for vnodes to be here..."
if cl.wait_vnodes({'timeout' => 600, 'port' => 22})
puts "Setting global /etc/hosts"
cl.set_global_etchosts()
puts "Setting global ARP tables"
cl.set_global_arptable()
else
puts "vnodes are unreachable"
exit 1
end
}
Then, you can deploy a 50 virtual nodes platform with:
The experiment can be launched from the coordinator node with the following command:
Here is the code of experiment.rb:
#!/usr/bin/ruby
require 'pp'
require 'distem'
require 'tempfile'
require 'rubygems'
require 'net/ssh'
NBVNODES = 50
REPS = 3
PATH_TO_KASCADE = ARGV[0]
nodes = (1..NBVNODES).collect {|i| "node#{i}"}
# 3 experiments are launched here:
# - run without failure
# - run with simultaneous failures of 5% of the nodes
# - run with sequential failures of 5% of the nodes
EXP = [ { :name => 'no_failure', :trace => nil },
{ :name => 'simult_5percent', :trace => [ [10, [4,14,24,34,44]] ] },
{ :name => 'seq_5percent', :trace => [ [10, [4]], [14, [14]], [18, [24]],
[22, [34]], [26, [44]] ] } ]
results = {}
# Create a node file for Kascade
f = Tempfile.new('kascade_ft')
nodes.drop(1).each { |node| f.puts(node) }
f.close
# Copy the node file into the first virtual node
system("scp #{f.path} root@node1:nodes")
# Copy Kascade into the first virtual node
system("scp #{PATH_TO_KASCADE} root@node1:kascade")
# Add execution rights to kascade
system("ssh root@node1 'chmod +x ~/kascade'")
# Generate a 500MB file
system("ssh root@node1 'dd if=/dev/zero of=/tmp/file bs=1M count=500'")
Distem.client { |cl|
EXP.each { |experiment|
results[experiment[:name]] = []
# Run the experiments several time
REPS.times.each { |iter|
puts "### Experiment #{experiment[:name]}, iteration #{iter}"
nodes_down = []
trace = experiment[:trace]
# Check if events have to be injected
if trace
trace.each { |dates|
date,node_numbers = dates
nodes_down += node_numbers.collect { |number| "node#{number}" }
node_numbers.each { |number|
cl.event_trace_add({ 'vnodename' => "node#{number}", 'type' => 'vnode' },
'churn',
{ date => 'down' })
}
}
cl.event_manager_start
end
# Perform a run
Net::SSH.start('node1', 'root', :password => 'root') {|ssh|
start = Time.now.to_f
ssh.exec('/root/kascade -n /root/nodes -i /tmp/file -o /dev/null -D taktuk -v fatal')
ssh.loop
results[experiment[:name]] << Time.now.to_f - start
}
# Clean
if trace
cl.event_manager_stop
puts "Let's restart #{nodes_down.join(',')}"
cl.vnodes_start(nodes_down)
ret = cl.wait_vnodes({'timeout' => 120, 'port' => 22, 'vnodes' => nodes_down})
if not ret
puts "Some nodes are unreachable"
exit 1
end
end
}
}
}
pp results
In this experiment the failure consists in stopping some virtual nodes at a given time. Other strategies could have been used, for instance we could have simulated a network issue where the network interface of some virtual nodes would have become unresponsive (without being completely shut down). This could have been achieved by setting a high latency to those network interfaces. The following code:
cl.event_trace_add({ 'vnodename' => "node#{number}", 'type' => 'vnode' },
'churn',
{ date => 'down' })
could have been replaced for instance with:
cl.event_trace_add({ 'vnodename' => "node#{number}",
'type' => 'viface',
'vifacename' => 'if0',
'viface_direction' => 'output' },
'latency',
{ date => '200000ms' })
In this case, we have added a latency of 200s on some network interfaces (up-link way), leading to almost unresponsive nodes.