This software (MPJ Express) is a reference implementation of the MPI bindings defined for the Java language. The current version of this software is following the mpiJava 1.2 API specification . We plan to add support for the MPJ API in a subsequent release. It is important to note that the difference between these two APIs is in the naming schemes for classes and methods. The functionality provided to users is essentially the same for both APIs.
This release contains the source code and binaries of MPJ Express library, as well as the runtime infrastructure. We have developed a test suite that imports various test cases from mpiJava; it also has a number of new test cases. This test suite checks the functionality of almost every MPI function. See the section on "MPJ Express test suite" for further details. This software has been tested on various UNIX and Windows operating systems. See the section "Tested Platforms" for a list of tested platforms.
There are two fundamental ways of running MPJ Express applications. The first, and the recommended way is using the MPJ Express runtime infrastructure, alternatively the second way involves the 'manual' start-up of MPJ Express processes.
The MPJ Express runtime infrastructure consists of daemons and the mpjrun module. The idea is, that the users of MPJ Express first start daemons on a number of compute-nodes, which in this document means the machines that execute MPJ Express processes. These can also be thought as the compute-nodes of a cluster. Once the daemons are running on compute-nodes, then the users can use the mpjrun module (using mpjrun.sh or mpjrun.bat scripts) on the cluster's head-node, which contacts the daemons, starting the MPJ Express application, and transports output back to the head-node so that users can view the progress of their programs during execution. The MPJ Express runtime infrastructure is able to run the code as JAR or class files. The runtime infrastructure provides the notion of local loaders and remote loaders. A user may prefer to use local loaders if the compute-nodes and head-node have a shared file system, and the MPJ Express JAR files as well as the user application is available locally on the compute-nodes. On the other hand, remote loaders can be used in cases where there is no shared file system on the compute-nodes, and the MPJ Express JAR files and the user applications have to be fetched from the head-node.
The second way, which is referred in this document as 'manual', is to run the shell script 'runmpj.sh' that uses SSH to execute the code. This script is able to run JAR or class files, but it is only possible to use this script on UNIX-based operating systems. For Windows, running test cases and applications manually means starting each MPJ Express process by using the java command.
The MPJ Express infrastructure does not deal with security in the current release. The MPJ Express daemons could be a security concern, as these are Java applications listening on a port to execute user-code. It is therefore recommended that the daemons run behind a suitably configured firewall, which only listens to trusted machines. In a normal scenario, these daemons would be running on the compute-nodes of a cluster, which are not accessible to outside world. Alternatively, it is also possible to start MPJ Express processes 'manually', which could help avoid runtime daemons. In addition, each MPJ Express process starts at least one server socket, and thus is assumed to be running on machine with configured firewall. Most MPI implementations assume firewalls as protection mechanism from the outside world.
export MPJ_HOME=/home/aamir/mpj
export PATH=$PATH:$MPJ_HOME/bin
These lines may be added to ~/.bashrc
export MPJ_HOME="c:\\mpj"
export PATH=$PATH:"$MPJ_HOME\\bin"
These lines may be added to ~/.bashrc
cd mpj-user mpjboot machines
cd mpj-user
mpjrun.sh -np 2 -jar $MPJ_HOME/lib/test.jar
mpjrun.bat -np 2 -jar %MPJ_HOME%/lib/test.jar
cd mpj-user javac -cp .:$MPJ_HOME/lib/mpj.jar World.java
javac -cp .;%MPJ_HOME%/lib/mpj.jar World.java
mpjrun.sh -np 2 World
mpjrun.bat -np 2 World
mpjrun.sh -np 2 -jar hello.jar
mpjrun.bat -np 2 -jar hello.jar
cd mpj-user
import mpi.*;
public class World {
public World() {
}
public static void main(String args[]) throws Exception {
MPI.Init(args);
int me = MPI.COMM_WORLD.Rank();
int size = MPI.COMM_WORLD.Size();
System.out.println("Hi from <"+me+">");
MPI.Finalize();
}
}
javac -cp .:$MPJ_HOME/lib/mpj.jar World.java
javac -cp .;%MPJ_HOME%/lib/mpj.jar World.java
Manifest-Version: 1.0
Main-Class: World
Class-Path: mpj.jar
jar -cfm hello.jar manifest World.class
jar -tf hello.jar
One of the challenging aspects of a Java messaging system is creating a portable mechanism for bootstrapping MPJ Express processes across various platforms. If the compute-nodes are running a UNIX-based OS, it is possible to remotely execute commands using RSH/SSH, but if the compute-nodes were running Windows, these utilities would not be available. The MPJ Express runtime provides a unified way of starting MPJ Express processes on compute-nodes irrespective of what operating system they may be using. The runtime system consists of two modules. The 'daemon' module runs on compute-nodes and listens for requests to start MPJ Express processes. The daemon is simply a Java application listening on an IP port, which starts a new JVM every time there is a request to run a MPJ Express processes. The 'mpjrun' module acts as a client to the daemon module. This module is started on, for example, the cluster head-node, and will contact daemons and return standard output for the user to view.
With Java, it is possible to run applications using class files, or class files bundled as a JAR file. The MPJ Express runtime allows the execution of MPJ Express applications both as JAR files and class files. With MPJ Express, the users may want to load MPJ Express JARs and classes either remotely or locally on the compute-nodes. With remote loader, it is possible to load all classes (application and MPJ Express code) from the head-node. This is useful in scenarios when there is no shared file system and the code is constantly being modified at the head-node. With local loader, it is possible to load all classes (application and MPJ Express code) from the compute-node. This might be useful if there is a shared file system. As all classes are loaded locally, this might provide better performance in comparison to remote loader. The default loader used in MPJ Express runtime infrastructure is remote loader. 'mpjrun' module provides -jar switch to execute JAR files and no switch is required to execute class files. The users can select local loading with the switch -localloader. The -wdir switch can be used to run the code in the appropriate directory on the remote node. When running JAR files using -localloader, the users should put the JAR in the CLASSPATH using -cp switch.
MPJ Express uses the Java Service Wrapper Project software to install daemons as a native OS service. This essentially means that there is some platform specific code used in order to achieve this. Currently, MPJ Express is distributing only Linux and Windows specific native code, but if you are interested in running MPJ Express daemons on other platforms like AIX, FreeBSD, HP-UX, HP-UX64. IRIX, MacOS, etc., then you can download the platform specific code from Java Service Wrapper Project . Some PATH variables in the scripts for these platforms will have to be changed. Feel free to contact us , if you need any help regarding this. The rest of this section explains how to install, start, stop, and uninstall MPJ Express daemons on Linux and Windows. In addition, it also shows how to run your MPJ Express programs using mpjrun module on these platforms.
cd mpj-user
machine1
machine2
Note that in the real-world, 'machine1' and 'machine2' would be
fully-qualified names, IP addresses, or aliases of your machines.
mpjrun.sh -np 2 World would assume a
'machines' file is present in this directory. If you have a list
of machines in a file (let us say) 'mymachines.txt' or in another
directory, then you can use -machinesfile switch to point mpjrun
to machines file. If you want to point mpjrun to mymachines.txt,
the exact command would be mpjrun.sh -machinesfile mymachines.txt
-np 2 World . This is also applicable to mpjrun.bat
localhost
mpjboot machines
mpjdaemon start command to start
the daemon.
rc-update add mpjdaemon default
mpjhalt machines
mpjrun.sh -np 2 World
mpjrun.bat -np 2 World
mpjrun.sh -np 2 -jar hello.jar
mpjrun.bat -np 2 -jar hello.jar
mpjrun.sh -np 2 -Xms512M -Djava.library.path=/tmp World
mpjrun.bat -np 2 -Xms512M -Djava.library.path=c:/tmp World
cd mpj-user
# Number of processes
2
# Protocol switch limit
131072
# Entry in the form of machinename@port@rank
machine1@20000@0
machine2@20000@1
# Number of processes
2
# Protocol switch limit
131072
# Entry in the form of machinename@port@rank
localhost@20000@0
localhost@20002@1
runmpj.sh mpj.conf World
java -cp .:$MPJ_HOME/lib/mpj.jar World <rank> mpj.conf niodev
java -cp .;%MPJ_HOME%/lib/mpj.jar World <rank> mpj.conf niodev
runmpj.sh mpj.conf hello.jar
java -jar hello.jar <rank> mpj.conf niodev
java -jar hello.jar <rank> mpj.conf niodev
MPJ Express contains a comprehensive test suite to test the functionality of almost every MPI function. This test suite consists mainly of mpiJava test cases, MPJ JGF benchmarks, and MPJ microbenchmarks. The mpiJava test cases were originally developed by IBM and later translated to Java. As this software follows the API of mpiJava, these test cases can be used with a little modification. MPJ JGF benchmarks are developed and maintained by EPCC at the University of Edingburgh . MPJ Express is redistributing these benchmarks as part of its test suite. The original copyrights and license remain intact as can be seen in source-files of these benchmarks in $MPJ_HOME/test/jgf_mpj_benchmarks. Further details about these benchmarks can be seen here. MPJ Express also redistributes micro-benchmarks developed by Guillermo Taboada . Further details about these benchmarks can be obtained here
The suite is located in $MPJ_HOME/tests directory. The test cases have been changed from their original versions, in order to automate testing. TestSuite.java is the main class that calls each of the test case present in this directory. The build.xml file present in test directory, compiles all test cases, and places test.jar into the lib directory. By default, JGF MPJ benchmarks and MPJ micro-benchmarks are disabled. Edit $MPJ_HOME/test/TestSuite.java to uncomment these tests and execute them. Note, after changing TestSuite.java, you will have to recompile the testsuite by executing 'ant' in test directory.
cd mpj-user
mpjrun.sh -np 2 -jar $MPJ_HOME/lib/test.jar
mpjrun.bat -np 2 -jar %MPJ_HOME%/lib/test.jar
runmpj.sh mpj.conf $MPJ_HOME/lib/test.jar
java -jar $MPJ_HOME/lib/test.jar <rank> mpj.conf niodev
java -jar %MPJ_HOME%/lib/test.jar <rank> mpj.conf niodev
antcd testantJava-docs can be seen in $MPJ_HOME/doc/javadocs
//rep.setThreshold((Level) Level.OFF ) ;
By calling
setThreshold(..) method of
LoggerRepository associated with rootLogger, the threshold level
for logging can be set. If this level is Level.OFF
then all logging is dropped by the rootLogger. By default,
the level is set to Level.ALL
cd $MPJ_HOME/bin
./mpjdaemon_linux_x86_32 console . Here we are starting
the daemon on a 32 bit x86 processor. Choose the appropriate script
for your machine.
cd %MPJ_HOME%/bin ;
wrapper.exe -c ../conf/wrapper.conf
mpjboot machines
Starting mpjd...
./mpjdaemon_linux_x86_32: line 1: ./daemon_linux_x86_32: cannot execute binary file
The reason is that by default x86 based code is called, which naturally does
not work on PPCs and UltraSPARCs.
We are currently in the process of writing smart scripts that call
the appropriate libraries based on the processor architecture and
operating system. In the meantime, this problem can be fixed in the
following way:
a. Edit $MPJ_HOME/bin/mpjboot and $MPJ_HOME/bin/mpjhalt b. Comment the line ssh $host "cd $MPJ_HOME/bin;./mpjdaemon_linux_x86_32 start;" c. Uncomment the line #ssh $host "cd $MPJ_HOME/bin;./mpjdaemon_solaris_sparc_64 start;" d. cd $MPJ_HOME/lib e. cp libwrapper.so_solaris_sparc_64 libwrapper.so
a. Edit $MPJ_HOME/bin/mpjboot and $MPJ_HOME/bin/mpjhalt b. Comment the line ssh $host "cd $MPJ_HOME/bin;./mpjdaemon_linux_x86_32 start;" c. Uncomment the line #ssh $host "cd $MPJ_HOME/bin;./mpjdaemon_linux_ppc_64 start;" d. cd $MPJ_HOME/lib e. cp libwrapper.so_linux_ppc_64 libwrapper.so
a. Edit $MPJ_HOME/bin/mpjboot and $MPJ_HOME/bin/mpjhalt b. Comment the line ssh $host "cd $MPJ_HOME/bin;./mpjdaemon_linux_x86_32 start;" c. Uncomment the line #ssh $host "cd $MPJ_HOME/bin;./mpjdaemon_macosx_ppc_32 start;" d. cd $MPJ_HOME/lib e. cp libwrapper.jnilib_macosx_ppc_32 libwrapper.jnilib
export MX_HOME=/opt/mx
<target name="all" depends="compile,jars,java-docs,clean"
to:
<target name="all" depends="mxlib,compile,jars,java-docs,clean"
Note that we have added mxlib in the value of "depends" attribute.
-bash-3.00$ mx_info MX Version: 1.1.7rc3cvs1_1_fixes MX Build: @indus1:/opt/mx2g-1.1.7rc3 Thu May 31 11:03:00 PKT 2007 2 Myrinet boards installed. The MX driver is configured to support up to 4 instances and 1024 nodes. [ .. ] ROUTE COUNT INDEX MAC ADDRESS HOST NAME P0 ----- ----------- --------- --- 0) 00:60:dd:47:ad:7c indus1:0 1,1 1) 00:60:dd:47:ad:68 indus4:0 1,1 [ .. ]This means we have two machines "indus1" and "indus4" connected to each other via Myrinet and they are using NIC-0. So now for this configuration, my machines file would look like this:
indus1 indus4
mpjrun.sh -np 2 -dev mxdev -Djava.library.path=$MPJ_HOME/lib HelloWorldThis command is assuming the Myrinet NICs with id 0 are used, this may be changed by using the mpjrun switch called "-mxboardnum"
There is a known (upto some extent) problem on Windows and Solaris that results in hanging MPJ processes. Normally this will be observed when MPJ test-cases will hang, as result, not completing or throwing any error message.
We partially understand the problem but if some user encounters this problem, we would request some more debugging information. The required information can be obtained as follows. Edit $MPJ_HOME/src/xdev/niodev/NIODevice.java and goto line 3673 and uncomment the line "ioe1.printStackTrace() ;". The line 3673 is in the MPJ Express release 0.27 and it might change in the future. The general code snippet is like this:
catch (Exception ioe1) {
if(mpi.MPI.DEBUG && logger.isDebugEnabled() ) {
logger.debug(" error in selector thread " + ioe1.getMessage());
}
//ioe1.printStackTrace() ;
} //end catch(Exception e) ...
if(mpi.MPI.DEBUG && logger.isDebugEnabled()) {
logger.debug(" last statement in selector thread");
}
} //end run()
}; //end selectorThread which is an inner class
As a result now, when test-cases are executed again, users will see stacks periodically. Most of these are related to socket closed exceptions that are normal. If the code hangs now, the latest stack trace that is not about socket being closed is perhaps the reason of this hanging behaviour. We would request the users to kindly email us the output so that we can fix the problem. A stack trace that leaves MPJ Express hanging on Solaris is as follows:
java.nio.channels.CancelledKeyException
at sun.nio.ch.SelectionKeyImpl.ensureValid(SelectionKeyImpl.java:55)
at sun.nio.ch.SelectionKeyImpl.readyOps(SelectionKeyImpl.java:69)
at java.nio.channels.SelectionKey.isAcceptable(SelectionKey.java:342)
at xdev.niodev.NIODevice$2.run(NIODevice.java:3330)
at java.lang.Thread.run(Thread.java:595)
Some users have noticed that it takes a long time to bootstrap MPJ Express processes. For example,
user@machine:~/mpj-user> mpjrun.sh -np 6 -jar $MPJ_HOME/lib/test.jar 16:15:43.400 EVENT Starting Jetty/4.2.23 16:15:43.415 EVENT Started HttpContext[/] 16:15:43.419 EVENT Started SocketListener on 0.0.0.0:15000 16:15:43.419 EVENT Started org.mortbay.http.HttpServer@23ac23ac 16:15:43.420 EVENT Starting Jetty/4.2.23 16:15:43.420 EVENT Started HttpContext[/] 16:15:43.421 EVENT Started SocketListener on 0.0.0.0:15001 16:15:43.421 EVENT Started org.mortbay.http.HttpServer@50265026 [ pause for a minute or two ] Starting process <0> onStarting process <1> on [ pause for a minute or two ] Starting process <2> on Starting process <3> on [ pause for a minute or two ] Starting process <4> on Starting process <5> on [ job starts ]
Thanks to Andy Botting who is one of the users that identified this problem. This problem is perhaps related to name resolution and we are currently working to fix it.
Exception in thread "main" java.lang.RuntimeException: Another mpjrun
module is already running on this machine
at runtime.starter.MPJRun.(MPJRun.java:135)
at runtime.starter.MPJRun.main(MPJRun.java:925)
This issue can be resolved by deleting mpjdev.conf file. This file would
be present in the directory, where your main class or JAR file is present.
So for example, if the users are trying to run "-jar ../lib/test.jar",
then this file would be present in ../lib directory.
chmod a+w $MPJ_HOME/logs
chmod a+x $MPJ_HOME/lib/*.dll
chmod a+w $MPJ_HOME/logs/wrapper.log