Debian Cluster Components
- on Intel's ia64 architecture -
Table of Contents
- Motivation
- Setup of the master node
- Create the client image
- LDAP configuration
- Torque configuration
- Install clients
- To do
1. Motivation
Debian GNU/Linux
contains all packages to manage a compute cluster system. Therefore,
there are no reasons why Debian should not be the linux distribution
for such a system. As a matter of principle, you
can configure all required cluster parts manually, e.g.,
the software imaging or the queueing system.
At this point, the
Debian Cluster Components will help you.
This is a collection of scripts that depend on all required cluster
software packages and will configure a complete working compute
cluster based on Debian.
But if you want to use Debian and DCC on Intel's Itanium2 architecture
(ia64) you will find some problems like
- there are no pre-compiled packages for the queueing system Torque,
- Intel's special EFI Firmware for the booting process,
- the Open LDAP server does not work on ia64 under Debian Sarge.
2. Setup of the master node
This section will describe how to configure the basic Debian system
and tells you how to create the missing packages that are not available
for ia64. Additionally, all preparations for the client installation
are explained.
2.1. Basic system installation
The Debian installation procedure is the same like on other architectures. But
during the partitioning step of the installation it is required that
you create an EFI boot partition.
It is recommended that this partition is the first one on your first hard disk.
Like mentioned in the
DCC installation instructions your partition
containing /var/ should be big enough to store the client image(s).
After the basic installation you have to configure your network in /etc/network/interfaces
# external network interface
auto eth0
iface eth0 inet static
address x.x.x.x
netmask y.y.y.y
network z.z.z.z
broadcast u.u.u.u
gateway v.v.v.v
# internal network interface
auto eth1
iface eth1 inet static
address 192.168.0.1
netmask 255.255.255.0
network 192.168.0.0
broadcast 192.168.0.255
Furthermore, you have to modify your host table
/etc/hosts for the internal cluster network
127.0.0.1 localhost.localdomain localhost # internal definition 192.168.0.1 master.localdomain master # optional external definition xxx.xxx.xxx.xxx cluster.your-external.domain cluster
On i386 and amd64 architectures you can install the Debian Cluster Components without any problems because all binary packages are available and are working. But on ia64 there are no Torque binaries. These have to be build by yourself and have to be installed before you install the DCC packages. In this context some packages have to be installed manually. This tutorial will tell you which packages have to be installed at which moment.
At the end of the basic installation still one word about the filesystem sharing between the nodes. It is recommended that the users login on the master node only and start their jobs there. In the presented configuration on the master node exists a directory /master/ that is shared to all working nodes into /master/ locally. Due to performance reasons each working node should hava a local directory /scratch/ that should be the working directory for the user's jobs. Your job script is executed on a working node and could copy the binary file into /scratch/, excute it there, followed by copying the resulting data files back to /master/.
2.2. Compile the queueing system Torque
It is possible to use the Debian source packages of Torque
served by the DCC project.
Add the following lines to your /etc/apt/sources.list
deb http://ftp.irb.hr/pub/irb/dcc/ ./ deb-src http://ftp.irb.hr/pub/irb/dcc/ ./Change into root's home directory or into /tmp/ and call
apt-get build-dep torque apt-get source torque cd torque* dpkg-buildpackageNow, all required Torque packages for the master and the client nodes have been compiled for ia64 and Debian packages are created. On the master node you have to install the following packages
aptitude install libcurses-perl dpkg -i torque-common_1.0.1p6-4_all.deb \ torque-server_1.0.1p6-4_ia64.deb \ torque-sched_1.0.1p6-4_ia64.deb \ torque-utils_1.0.1p6-4_ia64.deb
2.3. Recompile tftpd-dcc and configure the network boot ftp daemon
A boot ftp server is required for the tftpd-dcc package. It is recommended to install
atftpd instead of tftpd because the
directory /tftpboot/ is treated in different ways by the both daemons.
Install atftp and some other required packages with
aptitude install atftpd systemimager-boot-ia64-standard \ systemimager-server eliloProbably, you have to configure the start of the atftpd daemon by the inetd in /etc/inetd.conf adding the line
tftp dgram udp wait nobody /usr/sbin/tcpd /usr/sbin/in.tftpd \ --tftpd-timeout 300 --retry-timeout 5 --mcast-port 1758 \ --mcast-addr 239.255.0.0-255 --mcast-ttl 1 --maxthread 100 \ --verbose=5 /tftpbootNow, restart the inetd daemon or send a HUP signal to it.
Debian uses the elilo boot loader to start on ia64 systems. This has to be prepared now using the tftpd-dcc package. tftpd-dcc depends on the package syslinux but this package is not available on the ia64 architecture. You have two possibilities
- create a virtual dummy package named syslinux,
- remove the dependence on syslinux in the package tftpd-dcc.
aptitude install debconf-dcc apt-get build-dep tftpd-dcc apt-get source tftpd-dcc cd tftpd-dcc*Now, the dependence line in the file debian/control is modified like
Depends: atftpd, systemimager-boot-ia64-standard, systemimager-server, eliloAdditionally, the files postinst and prerm have to be modified. These new scripts can be downloaded here Now, this package can be compiled by
dpkg-buildpackageand installed with
dpkg -i tftpd-dcc
After the installation of tftpd-dcc the directory /tftpboot/ is prepared for the network boot installation of the clients. More details can be found in the SystemImager FAQs. The master node is now ready to become the "real" master by installing the package dcc-front
aptitude install dcc-front
3. Create the client image
On ia64 systems you have to perform some adaptions of the
client image which are explained in the following section.
3.1. Image creation configuration files
The required configuration files are located in /etc/dcc/.
The file config contains the name of the
image and the hostname prefix for the clients, and the start ip address
for the clients. In disktable the
partioning of the clients is defined. If you use the gpt
hard disk label instead of the msdos label it should
be possible to define more than four partitions but this
does not work.
The first partition on the first hard drive has to be the
EFI boot partition. This partition has to have the filesystem type
vfat and has to be mounted in
/boot/efi/. The mount flags are
defaults and bootable.
A useable disktable file could look like
label_type=gpt /dev/sda1 200 vfat /boot/efi defaults bootable /dev/sda2 6144 swap /dev/sda3 6144 xfs /tmp defaults /dev/sda4 * xfs / defaults 192.168.0.1:/master - nfs /master rw
Now, you have to modify the package file packages.list. In this file the client kernel image should be defined but this does not work. You have to install the kernel image manually after the image creation. The package file has to contain the following lines
elilo systemconfigurator libcurses-perl tk8.3 discoverYou can add more packages that you like to have available on the clients.
Finally, the source list in sources.list can be adapted for other debian mirrors you prefer. This file could look like
deboot http://ftp.debian.org/debian sarge deb ftp://ftp.fu-berlin.de/pub/unix/linux/mirrors/debian/ stable main deb-src ftp://ftp.fu-berlin.de/pub/unix/linux/mirrors/debian/ stable main deb http://security.debian.org/ stable/updates main deb http://ftp.irb.hr/pub/irb/dcc/ ./ deb-src http://ftp.irb.hr/pub/irb/dcc/ ./
3.2. dcc_buildimage
After all preparations in the last section the building script
of the DCC
can be called
dcc_buildimage
3.3. Image post package installation and configuration
If the image was created successfully you have to edit the image
and to install the last packages which are not available for
ia64. Especially, the boot loader has to be configured.
Copy the Torque binaries into the image's /tmp/ directory
cp torque*.deb /var/lib/systemimage/images/IMAGE_NAME/tmp/Now, change into the image with
dcc_editimage IMAGE_NAME
Inside the image call the following commands
cd /tmp/ dpkg -i torque-common_1.0.1p6-4_all.deb \ torque-mom_1.0.1p6-4_ia64.deb \ torque-utils_1.0.1p6-4_ia64.deb rm torque* aptitude upgrade aptitude install dcc-node aptitude install kernel-image-VERSION-mckinley-smp
It is recommended that the mail system of the clients is configured properly. E.g., you could configure exim in that way that no local mail is used and all system mails to root will be sent to a smarthost.
It could be that not the complete hardware of the clients is supported automatically after the installation, the usb keyboard respectively. But as a matter of principle, this is not nessessary because you login remotely to the clients only. But nevertheless, if you want to have a working usb keyboard you can adapt the file /etc/modules and define the required modules.
You can leave the image chroot environment with "exit".
3.4. Image boot loader configuration
The boot loader requires the file elilo.efi
and the used kernel in a special direcory. Call the following commands
inside the image
cd /boot/ mkdir efi cd efi/ cp /usr/lib/elilo/elilo.efi . cp ../vmlinuz* vmlinuzNow, a very important modification of the boot loader configuration perl module of the systemconfig package has to be done. Edit the file /usr/lib/systemconfig/Boot.pm inside the image and comment out all boot modules except Boot::EFI.
Leave the image and call
mksidisk -A --name node --file /etc/dcc/disktable mkautoinstallscript --image node --force --ip-assignment dhcp --post-install rebootIf your internal network interface of the clients is not eth0 you have to adapt the image install script in /var/lib/systemimager/scripts/, additionally. Change eth0 to ethX in the [INTERFACE]-Section.
After this, you have to reenter the image and to edit the file /etc/systemconfig/systemconfig.conf. Normally, you have to modify the path-line in the kernel section only. vmlinuz has to be in the root.
# systemconfig.conf written by systeminstaller.
CONFIGBOOT = YES
CONFIGRD = YES
[BOOT]
ROOTDEV = /dev/sda4
BOOTDEV = /dev/sda
DEFAULTBOOT = vmlinuz
[KERNEL0]
PATH = /vmlinuz
LABEL = vmlinuz
4. LDAP configuration
Due to the fact that the Open LDAP server does not
work under Debian Sarge ia64 the automatic ldap configuration
of the DCC project
does not work, too. One simple solution
is that you use an existing ldap server in your network. You
need to reconfigure the ldap configuration on the
master node and inside the image.
/etc/ldap/ldap.conf
BASE dc=your,dc=domain URI ldaps://your.ldap.server TLS_CACERT /path/to/your/cacert.pem
/etc/libnss_ldap.conf
host your.ldap.server base dc=your,dc=domain uri ldaps://your.ldap.server ldap_version 3 timelimit 30 bind_timelimit 30 pam_filter objectclass=posixAccount pam_password md5 nss_base_passwd ou=tree_in_ldap_where_user_accounts_are,dc=your,dc=domain nss_base_group ou=tree_in_ldap_where_groups_are,dc=your,dc=domain
/etc/pam_ldap.conf
host your.ldap.server base dc=your,dc=domain uri ldaps://your.ldap.server ldap_version 3 timelimit 30 bind_timelimit 30 pam_filter objectclass=posixAccount pam_password md5
The pam configuration and the nsswitch.conf are already configured properly by the DCC installation.
5. Torque configuration
Last but not least some small adaptions of the Torque system
have to be done.
The Torque mom daemon's configuration file
/etc/torque/mom_config inside the image
requires the $usecp parameter
$clienthost master.localdomain $restricted master.localdomain $logevent 0x1ff $usecp cluster.your.domain:/master /master $usecp cluster:/master /master $usecp node1.localdomain:/master /master $usecp node1:/master /master $usecp node2.localdomain:/master /master $usecp node2:/master /master # two entries for each work node # (one with domain and one without)On the master node (outside the image) it could be that you have to modify the node properties (e.g., the number of CPUs of each node) in the file /var/spool/torque/server_priv/nodes (see the Torque documentation).
6. Install clients
Now, you can proceed like explained in the
DCC installation documentation.
dcc_dicovernodeand boot the clients over the internal network device. The discovering is required for the first client start only. Is the client stored in the SIS database it is sufficient for an installation to boot the client over the network without any actions on the master node.
7. To do
Currently, the pushing of image modifications to all clients does not
work without any problems. If you call the image pushing command
the boot loader configuration of the clients will be destroyed.
First tests that prevent this bahavior lead to a damaged network configuration
of the clients.
At the moment I am working for a solution of this problem.
Current workaround: reinstall the client if you have performed changes
on the image.
Acknowledgment
Thanks a lot to Valentin Vidic, one of the authors of the
Debian Cluster Components Project. He spent a lot of time
in answering my questions and gave many helpful hints to
find a working configuration for the DCC on an ia64 cluster.