Enabling AppPot on SMSCG
AppPot is a way to implement generic application deployment on a computational Grid, and especially to enable users to provide their own software to the computing cluster.
On the SMSCG infrastructure, AppPot is used as core virtualization mechanism.
This page describes how to install AppPot and enable a base AppPot RTE.
Basic installation
Note: the AppPot base image referenced in these notes is a 64-bit image; it will not run on 32-bit hosts. (Conversely, a 32-bit AppPot image can work on a 64-bit system; there are however some incompatibilities between the AppPot Linux kernel and the default Linux kernel shipped with CentOS5 that make this combination not deployable on SMSCG.)
Download and install the basic AppPot files
AppPot installation procedure:
-
Make a directory for storing all the AppPot-related files. This directory needs to be accessible from all compute nodes in your cluster.
In the following examples, we shall assume that this directory is
/share/grid/apppotbut of course the value is dependent on your local cluster configuration.Note: We assume that you run all the following commands in this chosen directory!
- Download the install AppPot starter script
apppot-start.shfrom the AppPot SVN repository:wget http://apppot.googlecode.com/svn/trunk/apppot-start.sh
- Download the AppPot reference base image:
wget http://ocikbapps.uzh.ch/gc3wiki/download/apppot/apppot-0.26.disk.img
-
Download the AppPot/UML Linux kernel 2.6.38 from http://uml.devloop.org.uk/kernels.html:
wget http://uml.devloop.org.uk/kernels/kernel64-2.6.38.8.bz2 bunzip2 kernel64-2.6.38.8.bz2
Note: The current kernel release may change from time to time. Do not pick the very latest kernel that is available from http://uml.devloop.org.uk/, but stick to a 2.6.38 or 2.6.39 version, that have been successfully tested. Ask on the SMSCG mailing list if unsure.
-
Verification step: Now you should be able to run AppPot:
./apppot-start.sh --apppot apppot-0.26.disk.img --kernel kernel64-2.6.38.8 ls -l /home/user/job
The above command should print a long list of Linux boot messages, followed by the listing of the current directory, followed by a few shutdown messages.
Download and install the auxiliary programs
In order to correctly run a job, AppPot requires two support programs, named slirp and empty.
You can install both programs in the same directory where other AppPot files are, or in any other place of your choosing. Just remember that the two binary executables slirp and empty have to be available in the system $PATH when the apppot-start.sh is run.
Download and install slirp
On a Debian/Ubuntu system, you just need to run the command:
sudo apt-get install slirp
On other Linux systems, you will have to compile from source.
- Download the sources from the AppPot SVN repository:
svn co http://apppot.googlecode.com/svn/trunk/slirp-1.0.17 slirp-1.0.17
- Run the following command in order to compile the binary
slirp-fullboltcd slirp-1.0.17 cd src export CFLAGS="-DFULL_BOLT -O2 -I. -DUSE_PPP -DUSE_MS_DNS -fno-strict-aliasing -Wno-unused" ./configure CFLAGS="$CFLAGS" make CFLAGS="$CFLAGS" PPPCFLAGS="$CFLAGS" clean all
- Now copy the
src/slirpfile to the AppPot directory:cp -a slirp /share/grid/apppot/slirp-fullbolt
Download and install empty
On a Debian/Ubuntu system, you just need to run the command:
sudo apt-get install empty-expect
On other Linux systems, you will have to compile from source. Download the sources for empty from http://empty.sourceforge.net/. Follow the instructions on the web page to compile and install it.
Configure the host system to manage AppPot/UML resources
Shared memory filesystem
UML uses file-backed mmap() calls to share memory among processes. Therefore, the available space on the filesystem where these files are created limits the total amount of memory available to AppPot/UML VMs.
As of version 2.6.38, the UML kernel examines the environmental variables (if defined) $TMP, $TEMP and $TMPDIR, then the directories /tmp and /dev/shm, in this order. The first to be found writable and executable is used as the shared memory location.
So, in order to use a different shared memory directory, one needs to set the $TMP or $TMPDIR environment variable prior to starting an AppPot/UML VM instance, for example in the 1 ) section of the RTE script:
# scratch directory for mmap() sharing export TMP=/state/partition1
It's better to use TMP or TEMP because TMPDIR is a standard variable name, used for many purposes on the system (for instance, SGE sets it to the job's spool directory).
Again: the size of the chosen scratch directory limits the total amount of memory that can be allocated by all UML instances. So, you want to choose a scratch filesystem that has *at least* as much free space as the node memory.
If the filesystem is not writable or files cannot be made executable when UML starts, it will fall back to the default /tmp or /dev/shm.
Note: This shared memory filesystem must be available on all the compute nodes, but it should be a local directory, not a shared one!
Using the tmpfs filesystem
The default directory used by UML for process memory sharing is /dev/shm, which on Linux systems is a tmpfs filesystem. Therefore, the size of tmpfs must be sufficiently large to run an UML instance with a given --mem specification. Otherwise, the UML job may fail due to insufficient memory. To set the size of tmpfs, modify the entry in /etc/fstab
tmpfs /dev/shm tmpfs defaults,size=... 0 0
with the new size, and then remount the filesystem:
# mount -o remount /dev/shm
It is recommended to set the size of the /dev/shm tmpfs filesystem to 95-98% of the total physical memory; tmpfs only uses its memory resources when needed.
Note: These changes must be executed on all the compute nodes!
Max mmap size
Furthermore, memory within UML is mapped to host memory using mmap() and thus, UML installs lots of mappings (the size of one page size is 4KB). This requires changing the value of /proc/sys/vm/max_map_count which is too low in standard Linux installations.
Make sure that the value of /proc/sys/vm/max_map_count is sufficiently large on all compute nodes!
To set /proc/sys/vm/max_map_count, use the command sysctl or put a line
vm.max_map_count=XXX
into the file /etc/sysctl.conf so that it will be re-applied at every reboot.
To check the value of 'max_map_count':
# sysctl -a | grep max_map_count
To set the value of max_map_count to "total physical memory [kB] divided by 4" in order to avoid any memory limits imposed by this counter:
# sysctl vm.max_map_count=`awk '$1 == "MemTotal:" {print int($2 / 4)}' /proc/meminfo`
No memory enforcement by the batch system
Batch systems miscompute the amount of memory used by running AppPot/UML instances because they assume that much of the memory used by a process is private and only a negligible fraction is shared (the situation in UML is exactly the opposite: most of the memory is shared and only a small fraction is private).
This issue has been observed with Sun Grid Engine 6.2 and PBS/TORQUE; it's very likely to happen with other batch systems as well.
The key point here is: UML will ensure that AppPot does not use more system memory than what specified with the --mem option (defaulting to 512M). Therefore, you can and should turn off all memory checks in the batch system when running AppPot/UML jobs!
ARC-specific setup
Create an RTE file ENV/APPOT-0.26 with the following contents (adapt the paths to your local installation)
#!/bin/bash
apppot_version='0.26'
case '$1' in
0 )
# ensure 'apppot-start.sh' is in PATH at submission time;
# change '/share/grid/apppot' with the path to the directory
# where you downloaded the AppPot files
export PATH=$PATH:/share/grid/apppot
export joboption_uml='yes'
;;
1 )
# prepare the environment (on the worker node)
#
# change '/share/grid/apppot' with the path to the directory
# where you downloaded the AppPot files
# do not overwrite APPPOT_IMAGE if it's already set by some other RTE
if [ -z '$APPPOT_IMAGE' ]; then
export APPPOT_IMAGE=/share/grid/apppot/apppot-0.26.disk.img
fi
export APPPOT_KERNEL=/share/grid/apppot/kernel64-2.6.38.8
export APPPOT_START=/share/grid/apppot/apppot-start.sh
export APPPOT_STARTUP='$APPPOT_START'
export PATH=$PATH:/share/grid/apppot
# set this to a local filesystem that AppPot/UML can use for
shared memory files. # The default is /dev/shm
export TMP=/dev/shm
;;
2 )
# no cleanup needs to be done
;;
* )
# This should never happen, but just exit with error code.
return 1
;;
esac
