TOC, 1, 2, 3,
4, 5, 6,
7, 8, 9,
10, 11, 12
13, 14
6) RUNNING NETWORK LINDA PROGRAMS
6.1) Why do I get the message:
"ntsnet: WARNING: ping may not be a valid Network Linda executable"
Ntsnet looks to see if a magic string is in the local executable file
which should be in all Network Linda executable files. It warns you if it isn't
there, but executes it anyway. If it doesn't work, it may be because it really
isn't a Network Linda executable.
One common reason for this is that the
executable is really a CDS executable, in which case, you should see a bunch of
messages like:
Linda initializing (2000 blocks).
Linda initialization complete.
Another reason for the warning message is if you are wrapping your
Network Linda executable in a shell script, a trick that used to be necessary to
run different executables on different nodes. If that
is the case, you can
ignore the message, as long as the shell script is written properly.
You
can try testing the file yourself, using the command:
exampletrings ping | grep linda_version
%__linda_version_tsnet_v2.5.2
Another quick way to find out if a file is a Network Linda executable is
to run it without ntsnet. You should see the messages:
ping: Network linda executable missing +LARGS argument, aborting.
ping: Use +LARGS and linda arguments if starting by hand,
ping: or start the executable using the ntsnet utility.
If you see
Linda initializing (2000 blocks).
Linda initialization complete.
then it's a CDS Linda executable, as mentioned above.
6.2) Why do I get the messages:
"Permission denied."
"ntsnet: too many workers exited to continue"
"ntsnet: needed: 1, started: 1, died: 1"
You are unable to rsh
to a node in your nodelist. The
rsh
fails, and ntsnet
aborts since it isn't able to
get a enough workers to satisfy its requirements.
You may also just see
the "Permission denied."
message, and the program runs fine. That
is because ntsnet
was able to get enough workers, even though it
didn't get all that it started. By default, ntsnet
only needs to
get one worker, although the -n
option can change this.
6.3) Why do I get the messages:
"rsh: shell/tcp: unknown service"
"ntsnet: too many workers exited to continue"
"ntsnet: needed: 1, started: 1, died: 1"
This is a variant of the previous answer. In this case, rsh
is failing to execute getservbyname(3
), perhaps due to an
overloaded NIS server. Use of the ntsnet -delay
option may help
this problem by decreasing the rate at which it forks rsh
processes, but you may just need to give ntsnet
more nodes to
choose from. Ultimately, your system administrator may have to reconfigure your
system to eliminate this problem.
6.4) Why do I get the message:
"stty: TCGETS: Operation not supported on socket"
This is one version of a classic rsh
problem. The problem is
that the user's .cshrc
file has an stty
command that
fails when rsh
is used (since rsh
doesn't use a pseudo
terminal). Ntsnet
starts up the workers on each remote node with
the rsh
command, by default. The standard solution is to use
something like:
if ($?prompt == 0) then
exit
endif
This could be put right at the beginning of .cshrc, but must be put
before the stty
commands. Other commands, such as
biff
, only work on interactive runs, giving different error
messages.
6.5) Why do I get the message:
"Linda Error: node maple(15): keepalive failure"
The error message isn't as informative as it could be. What happened is
that node maple noticed that another node was not responding to what we call
"keep alive" messages. If a node isn't able to respond to keep alive messages,
it is probably in some bad state (perhaps due to NFS problems?) that could cause
the whole Linda program to hang. So rather than have you run for another couple
of days before you get suspicious enough to abort the run, maple sounded the
alarm, exited, and ntsnet
shut the program down.
In some
cases, it might be useful to increase the keep alive period. This can be done as
in the example:
tsnet -kainterval 400 ping 100
This might be useful for long running jobs on some networks.
There
is a discussion of keep alive messages on page 4-23 of the
C-Linda User's
Guide.
6.6) Why do I get the messages:
"ntsnet: warning: rup rpc failed on oak: Program not registered"
"ntsnet: using fallback load: 0.990000"
By default, ntsnet
uses a remote procedure call
(rpc
) to the rstatd
daemon to determine the load
average of the remote machines. The rpc
fails with this message if
rstatd is not running on one of the remote machines.
Many machines don't
enable rstatd
by default, and some machines don't support it. It
can usually be enabled by uncommenting the appropriate line in
/etc/inetd.conf
and reinitializing inetd
by sending it
a SIGHUP
. Ask your system administrator if this can be
done.
The message can be avoided by telling ntsnet
not to
get the load averages of remote machines. This can be done by setting the
getload
resource to false in the tsnet.config
file,
or, equivalently, by using the ntsnet +getload
command line
option.
6.7) Why do I get the message:
"More evals than processors - deadlock could occur"
Ntsnet
is used to start a fixed number of eval servers -
processes that handle eval requests. These servers only handle one request at a
time, therefore, a backlog of eval
requests can occur if more evals
are executed than there are eval servers. Deadlock can occur if the program is
written to assume that all eval'd processes are executing concurrently. If
deadlock doesn't occur, all eval requests will eventually be serviced by an eval
server that has finished processing a previous eval request.
6.8) How do I run my Network Linda program
on a heterogeneous network?
Ntsnet
has many features
that support executing Network Linda programs on heterogeneous networks. The
suffixstring resource (specified in the configuration file) can be used to tell
ntsnet
to use a different executable file on different machines.
For example, with the configuration file
ntsnet.Appl.hp1.suffixstring: .hp
ntsnet.Appl.hp2.suffixstring: .hp
ntsnet.Appl.mysparc.suffixstring: .sparc
ntsnet.Appl.myrs6k.suffixstring: .rs6k
the command
tsnet ping 100
will cause ntsnet to use three different executables, ping.hp,
ping.sparc
, and ping.rs6k
on the four different
nodes.
Also, map files can be used to equivalence a set of directories in
such a way that a different directory is used for each platform. For example,
with the map file
map /usr/bin/linda {
hp1 hp2 : /usr/bin/linda/hp;
mysparc : /usr/bin/linda/sparc;
myrs6k : /usr/bin/linda/rs6k;
}
the command
tsnet /usr/bin/linda/hp/ping 100
executed on hp1 will cause ntsnet to use three different
executables,
/usr/bin/linda/hp/ping, /usr/bin/linda/sparc/ping
,
and
/usr/bin/linda/rs6k/ping
on the four different
nodes.
6.9) Can I execute Network Linda programs
without using the rshd
daemon?
Yes, ntsnet
uses the linda_rsh
shell script to insulate it from the actual
command used for remote execution. Linda_rsh
can be modified by the
user, but the supplied version supports both "rsh
" and
"on
". Ntsnet
passes the value of the
lindarsharg
resource to linda_rsh
for each remote
process, so either "rsh
" or "on
" can be used. For
example, with the configuration file
ntsnet.Node.lindarsharg: on
ntsnet.mydec.lindarsharg: rsh
ntsnet
will cause linda_rsh
to use
"on
" for all nodes except "mydec
", since DECstations
don't support "on
".
If linda_rsh
is modified by
the user to support another remote execution command, the
lindarsharg
resource can still be used to let
linda_rsh
choose what command to use. Ntsnet
just
passes the appropriate value of the lindarsharg
resource for each
node.
6.10) Why do I get the message:
"ntsnet: shutting down with return code 9"
when my Network Linda program finishes?
Either your
real_main
function explicitly or implicitly returned a value other
than zero. The real_main
function is defined to return an integer
value. If there is no return in real_main
, an undefined value will
be returned to ntsnet
, which is reported by
ntsnet
.
6.11) How can I tell ntsnet exactly how
many processes to schedule on each node?
Try putting something like
the following in your tsnet.config
file:
! These settings are for "manual mode" scheduling.
! The speedfactor and minworkers values are necessary default values.
ntsnet.Appl.getload: False
ntsnet.Appl.maxprocspernode: 1000000
ntsnet.Node.speedfactor: 1.0
ntsnet.Appl.maxworkers: 1000000
ntsnet.Appl.minworkers: 1
! These settings reflect the desire to not count the master
! process, and to run one worker per node.
ntsnet.Appl.masterload: 0.0
ntsnet.Node.threshold: 1.0
You can now use the threshold resource to control how many workers are
scheduled on a given node. For example, if you want to schedule three processes
on node "frank", just add the line:
ntsnet.frank.threshold: 3.0
If you decide that you want to include the master process in the count,
just remove the line that sets masterload to 0.0 (it defaults to
1.0).
Note that by setting maxworkers to a million, ntsnet schedules as
many processes as it can on each node. The threshold acts as the limit. With
minworkers set to one, ntsnet doesn't consider it an error to only schedule, say
ten processes, rather than one million.
6.12) How can I tell ntsnet to run one eval
server on every node, including the local node (so that the local node executes
both the realmain()
process and a worker
process)?
It can be done using the same basic technique described in
the previous answer. Try putting something like the following in your
tsnet.config
file:
ntsnet.Appl.getload: False
ntsnet.Appl.maxprocspernode: 2
ntsnet.Node.speedfactor: 1.0
ntsnet.Appl.maxworkers: 1000000
ntsnet.Appl.minworkers: 1
ntsnet.Appl.masterload: 0.0
ntsnet.Node.threshold: 1.0
6.13) Why do I get the message:
"ntsnet: WARNING: ping appears to be incompatible with ntsnet"
This means that your Network Linda program was built with different
version of Network Linda than the one that you're using to execute it. This is
only a warning, but if your program doesn't work properly, this is probably the
reason.
6.14) Why does my program take so long to
start executing?
This is usually because rsh/rshd
is
taking a long time. Rshd
is usually slow because it reads and
executes the user's .cshrc
file on the remote machine. It is
generally a good idea to modify the .cshrc
file to not do very much
when invoked via rsh
. This is described in Subject 6.4. Basically,
put something like
if ($?prompt == 0) then
exit
endif
near the beginning of .cshrc
(probably after setting path).
This can sometimes avoid extra work and speed execution.
However, the
real problem may be that it is taking a very long time for rshd
to
even start reading .cshrc
, due to the way your home directory is
configured. For example, if your home directory is auto mounted by remote
machines on your network, when you start executing your Network Linda program,
all the remote machines will have to mount the exported partitions of your local
machine.
One solution is to hard mount, rather than auto mount, your home
directory on the remote machines. Another solution is to make the home directory
local on each of the remote machines. Your home
directory can still contain
symbolic links to common directories, but if .cshrc
is local to
each remote machine, start up time for your Network Linda program can be much
faster.
It is a good idea to test the speed of rsh with a simple example,
such as
example% rsh remotenode date
Once rsh
runs faster, there may still be a problem due to
the Network Linda executable not being local to the remote machines. The best
thing is to distribute the executable to a local directory on all the remote
machines once, and then use ntsnet
to execute it from now on. This
makes sense for production use, in particular.
There are many ways to
distribute the executable. You can execute rcp
directly, or use
ftp
. You can also use the ntsnet -distribute
option
with +cleanup
to distribute the executable, but not delete it
afterwards.
This distribution scheme assumes that the executable is in a
directory that has the same path on all machines, but is local on all machines.
A map file can be used to make different path names equivalent. You could try
experiments with /tmp (which is usually local to each machine) to see if this is
a reason for slow start up.
6.15) How do I set environment variables
for remote Linda processes?
There are various ways that this can be
done, but the different methods can depend on the remote execution mechanism
that you're using. For instance, if you're using "on
" (by setting
the lindarsharg
resource in your configuration file), then the
local environment is automatically exported to remote processes.
Rsh
doesn't do that, so another mechanism is necessary. If
csh
is the default shell on a remote node, you can set environment
variables in your remote .cshrc
files, which are sourced by
csh
before executing the Linda program. If you're using sh
or ksh
, this isn't possible, since .profile
isn't
sourced for non-interactive execution. In that case, you may need to create a
modified version of linda_rsh
to achieve the desired
effect.
For example, you want to set the environment variable
FOO
to different values on different remote nodes. One way to do
this is to add a few lines to linda_rsh
, as follows:
*) case "$rsh_arg" in
on) exec /usr/bin/on -n $host "$@"
;;
+ DISPLAY=*) exec /usr/ucb/rsh $host $user -n $rsh_arg "$@"
+ ;;
*) exec /usr/ucb/rsh $host $user -n "$@"
;;
esac
;;
The lines prefixed by "+
" are the new lines. Note that the
path for rsh
and on varies on different platforms, so don't use
this verbatim.
Now add the following lines to
~/.tsnet.config
:
ntsnet.frank.lindarsharg: DISPLAY=biff:0
ntsnet.joe.lindarsharg: DISPLAY=chet:0
ntsnet.Node.lindarsharg: DISPLAY=junk:0
Now, when ntsnet
executes linda_rsh
to start a
process on node frank, it will include the option "-r
DISPLAY=biff:0
", and on node joe, it will include the option "-r
DISPLAY=chet:0
". In both cases, the new line in linda_rsh
will be used to execute the remote process, and DISPLAY will be set as specified
in the remote processes.
Another method is to use wrap the Linda program
in a shell script that determines what node it is running on, and sets any
environment variables, and then execs the Linda program. The shell script could
look something like:
#!/bin/sh
case `hostname` in
frank*) DISPLAY=biff:0 ;;
joe*) DISPLAY=chet:0 ;;
*) DISPLAY=junk:0 ;;
esac
export DISPLAY
exec /usr/linda/bin/foo "$@"
Note that using this method, you will see warning messages from
ntsnet
that you may not be executing a valid Network Linda program.
This is correct, since ntsnet
only sees the shell script, which
isn't a Network Linda program. You can safely ignore the warnings.
We
suggest that you use the ntsnet -vv
option when debugging these
kinds of changes.
See sections 6.4, 6.9, and 6.14 for more information on
the topics of rsh
and lindarsharg
.
6.16) Why do I get the message:
tsnet -n 3 suite
Internal Error... Opening passwd file in parse passwd file
Contact Customer Service at Scientific Computing Associates
TEL 203-777-7442
FAX 203-776-4074
EMAIL lsupport@LindaSpaces.com
ntsnet: master process exited with return value 1
when trying to run my Linda program?
Check the permissions
on the Linda license file <top>/lib/linda.lcn
to be sure that
it is readable by you.
6.17) Why do I get the message:
tsnet -n 2 -suffix ping
Linda Error: node sol1 (-1): hostname not found in configuration file
ntsnet: worker on node sol1.sca.com exited abnormally
cl-examples
ping.cl ping.sun ping.sol
% more ~/.tsnet.config
Tsnet.Appl.nodelist: sun1 sun2 sol1
Tsnet.Appl.Node.suffixstring: .sun
Tsnet.Appl.sol1.suffixstring: .sol
when trying to run my Linda program heterogenously?
This
can occur if your Network Linda executables were built with different version of
the Linda compiler. To run heterogenously, all Linda executables need to be
compiled with the same version of the clc
compiler.
TOC, 1, 2, 3,
4, 5, 6,
7, 8, 9,
10, 11, 12
13, 14