Advanced features

1 Choose your Debian version
2 Vnode’s filesystem
3 Network specifics
- 3.1 Choice of the network mode
  - 3.1.1 Classical mode
  - 3.1.2 VXLAN mode
- 3.2 Root interface of a vnetwork
4 Complex latencies definition
5 Mapping virtual nodes using Alevin
6 Memory limitation
7 IO throttling
8 Cgroups v1 / v2

This page aims at describing how some advanced Distem features can be used. This has to be read only once you know a bit Distem and after you have achieved the tutorial.

1 Choose your Debian version

By default, Distem is configured to run Debian Stretch. As far as it is also working with Buster, you can choose the Debian version by using a dedicated flag of distem-bootstrap:

$> distem-bootstrap -f $OAR_NODEFILE --debian-version buster

2 Vnode’s filesystem

Depending on your requirements, vnodes can use a distinct or a shared filesystem. In the case of a distinct filesystem, the vnode’s rootfs will be duplicated as much time as the number of vnodes on a given pnode. This can be an issue with a slow hard-drive since it can take a lot of time to achieve the deployment of a large number of vnodes.

If you do not care about having a private filesystem for each of the vnodes, you can share it among the vnodes belonging to the same pnode. Thus only on copy of the rootfs is required (at least if the vnodes have the same rootfs) during the deployment of the vnodes.

Finally, if you need a distinct filesystem and to deploy a lot of vnodes, you can use a COW-based filesystem. This is supposed to be the best trade-off between performance and isolation, but it is an experimental feature.

2.1 Distinct filesystem

To deploy vnodes with a distinct filesystem, you just have to set the shared property to false in the vfilesystem description when you create a vnode.

Here is an example:

result = cl.vnode_create(node, { 'vfilesystem' =>{'image' => img, 'shared' => false},
                                 'vifaces' => [{'name' => 'if0', 'vnetwork' => 'vnet'}] })

2.2 Shared filesystem

To deploy vnodes with a shared filesystem, you just have to set the shared property to true in the vfilesystem description when you create a vnode.

Here is an example:

result = cl.vnode_create(node, { 'vfilesystem' =>{'image' => img, 'shared' => true},
                                 'vifaces' => [{'name' => 'if0', 'vnetwork' => 'vnet'}] })

Note: using this property only impacts the vnodes belonging to the same pnode. Thus if you need to share the same filesystem across all the vnodes of a vplatform, you will have to deploy your own network filesystem (like NFS for instance).

2.3 COW filesystem

This feature is based on the Btrfs filesystem.

In order to use it, you first have to run distem-bootstrap with an additional flag saying that you need to format the /tmp directory (this is the directory where Distem stores the vnodes’s rootfs) using Btrfs. To do that, you can run:

$> distem-bootstrap -f $OAR_NODEFILE --btrfs-format /dev/sda5

Then, in the creation of a vnode, you have to set the cow property to true, and the shared property to false. Here is an example:

result = cl.vnode_create(node, { 'vfilesystem' =>{'image' => img, 'shared' => false, 'cow' => true},
                                 'vifaces' => [{'name' => 'if0', 'vnetwork' => 'vnet'}] })

3 Network specifics

3.1 Choice of the network mode

Distem offers two network modes for the inter-pnode communications. The first one, called “classical mode”, leverages classical Ethernet NIC and does not use packet encapsulation. The second one, called “VXLAN mode”. It leverages VXLAN, that encapsulates Ethernet packets into UDP packets. In addition to allow usage of adapters that do not support L2 network communication (like IPoIB devices for instance), its also isolates traffic like VLAN. “VXLAN mode” allows also to relieve switches in the infrastructure from learning MAC addresses of vnodes (this has a great interest when deploying a large number of vnodes). However packet encapsulation induces a small overhead.

3.1.1 Classical mode

This mode is the default mode in Distem. So you do not have to do anything to use it. However you can specify it by using a parameter (network_type) when creating a vnetwork, for instance:

result = cl.vnetwork_create('vnet1', '10.144.0.0/18', {'network_type' => 'classical'})

3.1.2 VXLAN mode

Here is an example of “VXLAN mode” usage:

result = cl.vnetwork_create('vnet2', '10.144.64.0/18', {'network_type' => 'vxlan'})

3.2 Root interface of a vnetwork

By default, Distem will choose the default network interface to attach a vnetwork. If you want to use another interface, like a low latency network interface, you can specify it using the dedicated option, for instance to use ib0:

result = cl.vnetwork_create('vnet2', '10.144.128.0/18', {'network_type' => 'vxlan', 'root_interface' => 'ib0'})

4 Complex latencies definition

In some cases, like P2P experiments for instance, you might need to setup a virtual platform where a latency is precisely defined between the vnodes. This could be achieved by using the classical latency definition on virtual interfaces but it would be very complex in practice. So, it is possible to define a latency between each couple of vnodes. This is achieved by defining a matrix M with n row and n columns where n is the number of vnodes. Each cell M_i, j being the latency between vnode_i and vnode_j. i and j are defined by the order a vnode in a list containing all the vnodes.

Here is an example of defining such latencies between vnodes:

nodes = (1..50).to_a.map { |i| "node#{i}" }
matrix = (1..50).to_a.map { (1..50).to_a.map { 10 + rand(20) }}
random_nodes = []
Distem.client do |cl|
  cl.set_peers_latencies(nodes,matrix)
end

Note: This operation has to be performed once all the vnodes of the platform have been started. Furthermore, this is not compatible with other latency definition.

5 Mapping virtual nodes using Alevin

We can use Alevin for mapping vnodes into the physical infrastructure under bandwidth and CPU constrains.

5.1 Platform deployment

In this example, we will use 3 nodes, if you do not have a reservation yet you can have a look at Make a reservation. Then, we deploy an environment using Kadeploy:


    frontend> kadeploy3 -f $OAR_NODE_FILE -k -e debian9-x64-big

5.2 Deploying Distem with support for Alevin

We have to use the latest revision of Distem, so let’s clone the repository somewhere:


     frontned> mkdir repositories
     frontend> cd repositories
     frontend> git clone https://github.com/madynes/distem.git
     frontend> cd distem

We have to download Alevin, there is already a compiled version provided by Distem. This version has been modified to read DOT files (thanks to hardik.soni@inria.fr):


     frontned> wget https://gforge.inria.fr/frs/download.php/file/35944/alevin-ext.jar

Once you have cloned the repository and cd into it, deploy Distem in the nodes using distem-bootstrap. We have to make sure that we activate the support for Alevin by passing the parameter –alevin with the PATH to the Alevin jar that you have just downloaded.


     frontned> scripts/distem-bootstrap -g --ci $PWD --alevin $PATH_TO_ALEVIN -p default-jdk,graphviz

We need to generate a physical topology, for that we have created a small Python script that uses Execo to get the physical topology of machines in Grid’5000. This script can be retrieved from Distem forge. You can use it like this (adapt the example to your specific machines):


     frontend> wget https://gforge.inria.fr/frs/download.php/file/35980/get_physical_topo.py
     frontned>  ~/get_physical_topo.py FILE_TOPO.dot $OAR_NODEFILE

Then, transfer the generated file to the coordinator:

     frontned> scp FILE_TOPO.dot root@COORDINATOR:~/

5.3 Creating a virtual platform with CPU and bandwidth constrains

Log into the coordinator and create the following file:


require 'distem'
require 'yaml'

IMAGE_FILE ="/home/cruizsanabria/jessie-mpich-lxc.tar.gz" # TO BE MODIFIED
NETWORK = "10.144.0.0/22"

vnode_topo = YAML.load(File.read(ARGV[0]))
physical_topo = ARGV[1]

Distem.client do |cl|

  puts 'Creating virtual network'

  cl.vnetwork_create("testnet",NETWORK)

  puts 'Creating containers'

  private_key = IO.readlines('/root/.ssh/id_rsa').join
  public_key = IO.readlines('/root/.ssh/id_rsa.pub').join

  ssh_keys = {'private' => private_key,'public' => public_key}

  vnode_topo.each do |vnode|

    res = cl.vnode_create(vnode["name"],{
                                         'vfilesystem' =>{'image' => IMAGE_FILE,'shared' => true},
                                         'vifaces' => [{'name' => 'if0', 'vnetwork' => "testnet",
                                                        'output' =>{"bandwidth" =>{"rate" => vnode["bandwidth"]} }}]
                                        }, ssh_keys)

    if vnode["cpu"] > 0
      cl.vcpu_create(vnode["name"], 1, 'ratio', vnode["cpu"])
    end

  end

  puts 'Starting containers'

#  cl.vnodes_to_dot("vnodes.dot") #uncomment this line for generating a dot file with the topology of the virtual platform
  cl.load_physical_topo(physical_topo)
  cl.run_alevin()

  vnodes_list = vnode_topo.map{ |vnode| vnode["name"]}
  cl.vnodes_start(vnodes_list)

  puts 'Waiting for containers to be accessible'
  start_time = Time.now

  cl.wait_vnodes()
  puts "Initialization of containers took #{(Time.now-start_time).to_f}"

end

This script will create several vnodes with different CPU and bandwidth constrains. Two new things to remark: we load the physical topology Distem is running on using the method load_physical_topo(file.dot) and we run Alevin using the method run_alevin(). Additionally, you have to create another file to specify the virtual nodes to create. This file has to be written in YAML and it has a simple format that goes something like this:


- name: node0
  bandwidth: "3000mbps"
  cpu: 1
- name: node1
  bandwidth: "1000mbps"
  cpu: 2
- name: node2
  bandwidth: "3000mbps"
  cpu: 1
- name: node3
  bandwidth: "1000mbps"
  cpu: 3
- name: node4
  bandwidth: "4000mbps"
  cpu: 2
- name: node5
  bandwidth: "1000mbps"
  cpu: 1

This way of creating the virtual platform is not in any way specific to Alevin. You can use Alevin in the previous examples, just make sure of calling the two methods load_physical_topo(file.dot) and run_alevin() before the vnodes_start() method.

5.4 Deploying the virtual platform

Finally, you run the aforementioned script like this:

     coord> ruby test_alevin.rb vnodes_list.yml topo_grisou.dot

This will run Alevin in order to find the proper mapping and then Distem will deploy the virtual nodes accordingly. You should have an output that looks like this:


     root@grisou-20:~# ruby test_alevin.rb vnodes_list.yml topo_grisou.dot
     Creating virtual network
     Creating containers
     Starting containers
     Waiting for containers to be accessible
     Initialization of containers took 10.026698105

If you create many virtual nodes with different constraints that cannot be fulfilled, Distem will exit with an error that looks like this:


    root@grisou-20:~# ruby test_alevin.rb vnodes_list_big.yml topo_grisou.dot
    Creating virtual network
    Creating containers
    Starting containers
    /usr/lib/ruby/vendor_ruby/distem/netapi/client.rb:744:in `check_error': HTTP Status: 500, (Distem::Lib::ClientError)
    Description: "",
    Body: <!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.0//EN">
    <HTML>
      <HEAD><TITLE>Internal Server Error</TITLE></HEAD>
        <BODY>
            <H1>Internal Server Error</H1>
            Alevin could not map all the vnodes, aborting ...
            <HR>
            <ADDRESS>
                 WEBrick/1.3.1 (Ruby/2.1.5/2014-11-13) at
                 localhost:4567
                </ADDRESS>
    </BODY>
       </HTML>

        from /usr/lib/ruby/vendor_ruby/distem/netapi/client.rb:799:in `block (2 levels) in raw_request'
        from /usr/lib/ruby/vendor_ruby/restclient/request.rb:228:in `call'
        from /usr/lib/ruby/vendor_ruby/restclient/request.rb:228:in `process_result'
        from /usr/lib/ruby/vendor_ruby/restclient/request.rb:178:in `block in transmit'
        from /usr/lib/ruby/2.1.0/net/http.rb:853:in `start'
        from /usr/lib/ruby/vendor_ruby/restclient/request.rb:172:in `transmit'
        from /usr/lib/ruby/vendor_ruby/restclient/request.rb:64:in `execute'
        from /usr/lib/ruby/vendor_ruby/restclient/request.rb:33:in `execute'
        from /usr/lib/ruby/vendor_ruby/restclient/resource.rb:67:in `post'
        from /usr/lib/ruby/vendor_ruby/distem/netapi/client.rb:798:in `block in raw_request'

We have to quit Distem in order to clean its state. This can be achieved by typing:


     coord> distem -q

However, you have to launch again Distem using distem-boostrap (from the frontend):


     frontned> scripts/distem-bootstrap -g --ci $PWD --alevin $PATH_TO_ALEVIN -p default-jdk,graphviz

You can omit the parameter -p given that the packages have already been installed previously.

6 Memory limitation

By default, memory of a LXC container, and as a consequence of a vnode, is not limited like it is with a typical virtual machine. Still, Distem can limit memory inside a vnode leveraging a cgroup feature. Distem provides memory limitation for cgroups v1 or v2. The version must be specified with the ‘hierarchy’ key. (See the note below on v1 versus v2 cgroups) Advice: use v2 if available.

Cgroup v1:

Distem memory limitation with cgroup v1 provides the keys 'mem' and 'swap' which 
set a hard limit (OOM kill of processes) on memory and a limit on the swap used by the vnode. 
(In megabytes)

result = cl.vnode_create('n1',
  {
    'vfilesystem' =>{'image' => img,'shared' => true},
    'vifaces' => [{'name' => 'if0', 'vnetwork' => 'vnet'}],
    'vmem' => {'hierarchy' => 'v1', 'mem' => 4096, 'swap' => 1024},
  }

Cgroup v2:

Distem memory limitation with cgroup v2 provides the keys "hard_limit", "soft_limit" and "swap".
The soft_limit key set a limitation that does not kill any process in the vnode but may be breached under certain conditions.
The use of a soft and a hard limit at the same time is possible.

result = cl.vnode_create('n2',
  {
    'vfilesystem' =>{'image' => img,'shared' => true},
    'vifaces' => [{'name' => 'if0', 'vnetwork' => 'vnet'}],
    'vmem' => {'hierarchy' => 'v2', 'soft_limit' => 4096, 'hard_limit' => 4150, 'swap' => 1024},
  }

Distem saves the hierarchy for later use and thus you can update the memory limitation while the vnode is running using vnode_update or vmem_update:

result = cl.vmem_update('n2', {'soft_limit' => 'max'}) #cancel the soft limit

7 IO throttling

Distem can perform IO throttling to limit read and/or write performances of vnodes. It leverages cgroup v1 or v2 features. Advice: use v2 if available. You can specify IO throttling when creating a vnode and update that throttling during your experiment. To achieve this, the ‘disk_throttling/limits’ key allows you to set a list of devices associated with the throttling parameters active in the vnode for each disks. This will not work for partition (i.e /dev/sda1). However limitation of a partition is possible if an additional virtualization layer (i.e KVM) is active on top of that partition. Each specified device must be accessible by the container associated with the vnode (a mknod operation has been performed on that device) ; ‘/dev/sda’ should be accessible by default. IO Throttling should work on most of virtual filesystems (ext4, btrfs …)

result = cl.vnode_create('n2',
  {
    'vfilesystem' => {'image' => img,
      'disk_throttling' => {'hierarchy' => 'v2', 'limits' =>
        [{'device' => '/dev/sda', 'read_limit' => 1024, 'write_limit' => 2048}]}
    },
  })

cl.vfilesystem_update('n2', 'disk_throttling' => {'limits' =>
        [{'device' => '/dev/sda', 'read_limit' => 'max', 'write_limit' => 'max'}]})

8 Cgroups v1 / v2

V2 needs a recent kernel version, systemd >=238 and LXC>=3.0 as it requires the unified hierarchy to be setup by the system. In any case, using v2 controllers requires cgroup_no_v1=c1,c2 in kernel parameters or to use a custom systemd configuration. For example to use memory v2, cgroup_no_v1=memory is required. To use both memory and the I/O controller on a v2 hierarchy, cgroup_no_v1=memory,blkio is required. Swap limitation requires swapaccount=1 for v1 or v2. These parameters can be specified in the environment description of a Kadeploy environment (kernel_params option of the boot section).