Scientific Computing Associates

TOC, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12 13, 14

6) RUNNING NETWORK LINDA PROGRAMS

6.1) Why do I get the message:

  "ntsnet: WARNING: ping may not be a valid Network Linda executable"

Ntsnet looks to see if a magic string is in the local executable file which should be in all Network Linda executable files. It warns you if it isn't there, but executes it anyway. If it doesn't work, it may be because it really isn't a Network Linda executable.

One common reason for this is that the executable is really a CDS executable, in which case, you should see a bunch of messages like:

  Linda initializing (2000 blocks).
		  Linda initialization complete.

Another reason for the warning message is if you are wrapping your Network Linda executable in a shell script, a trick that used to be necessary to run different executables on different nodes. If that
is the case, you can ignore the message, as long as the shell script is written properly.

You can try testing the file yourself, using the command:

 exampletrings ping | grep linda_version
		 %__linda_version_tsnet_v2.5.2

Another quick way to find out if a file is a Network Linda executable is to run it without ntsnet. You should see the messages:

  ping: Network linda executable missing +LARGS argument, aborting.
		  ping: Use +LARGS and linda arguments if starting by hand,
		  ping: or start the executable using the ntsnet utility.

If you see

  Linda initializing (2000 blocks).
		  Linda initialization complete.

then it's a CDS Linda executable, as mentioned above.

6.2) Why do I get the messages:

  "Permission denied."  
		  "ntsnet: too many workers exited to continue"
		  "ntsnet: needed: 1, started: 1, died: 1"

You are unable to rsh to a node in your nodelist. The rsh fails, and ntsnet aborts since it isn't able to get a enough workers to satisfy its requirements.

You may also just see the "Permission denied." message, and the program runs fine. That is because ntsnet was able to get enough workers, even though it didn't get all that it started. By default, ntsnet only needs to get one worker, although the -n option can change this.

6.3) Why do I get the messages:

  "rsh: shell/tcp: unknown service"
		  "ntsnet: too many workers exited to continue"
		  "ntsnet: needed: 1, started: 1, died: 1"

This is a variant of the previous answer. In this case, rshis failing to execute getservbyname(3), perhaps due to an overloaded NIS server. Use of the ntsnet -delay option may help this problem by decreasing the rate at which it forks rsh processes, but you may just need to give ntsnet more nodes to choose from. Ultimately, your system administrator may have to reconfigure your system to eliminate this problem.

6.4) Why do I get the message:

  "stty: TCGETS: Operation not supported on socket"

This is one version of a classic rsh problem. The problem is that the user's .cshrc file has an stty command that fails when rsh is used (since rsh doesn't use a pseudo terminal). Ntsnet starts up the workers on each remote node with the rsh command, by default. The standard solution is to use something like:

if ($?prompt == 0) then
		  exit
		endif

This could be put right at the beginning of .cshrc, but must be put before the stty commands. Other commands, such as biff, only work on interactive runs, giving different error messages.

6.5) Why do I get the message:

  "Linda Error: node maple(15): keepalive failure"

The error message isn't as informative as it could be. What happened is that node maple noticed that another node was not responding to what we call "keep alive" messages. If a node isn't able to respond to keep alive messages, it is probably in some bad state (perhaps due to NFS problems?) that could cause the whole Linda program to hang. So rather than have you run for another couple of days before you get suspicious enough to abort the run, maple sounded the alarm, exited, and ntsnet shut the program down.

In some cases, it might be useful to increase the keep alive period. This can be done as in the example:

 tsnet -kainterval 400 ping 100

This might be useful for long running jobs on some networks.

There is a discussion of keep alive messages on page 4-23 of the
C-Linda User's Guide.

6.6) Why do I get the messages:

  "ntsnet: warning: rup rpc failed on oak: Program not registered"
		  "ntsnet: using fallback load: 0.990000"

By default, ntsnet uses a remote procedure call (rpc) to the rstatd daemon to determine the load average of the remote machines. The rpc fails with this message if rstatd is not running on one of the remote machines.

Many machines don't enable rstatd by default, and some machines don't support it. It can usually be enabled by uncommenting the appropriate line in /etc/inetd.conf and reinitializing inetd by sending it a SIGHUP. Ask your system administrator if this can be done.

The message can be avoided by telling ntsnet not to get the load averages of remote machines. This can be done by setting the getload resource to false in the tsnet.config file, or, equivalently, by using the ntsnet +getload command line option.

6.7) Why do I get the message:

  "More evals than processors - deadlock could occur"

Ntsnet is used to start a fixed number of eval servers - processes that handle eval requests. These servers only handle one request at a time, therefore, a backlog of eval requests can occur if more evals are executed than there are eval servers. Deadlock can occur if the program is written to assume that all eval'd processes are executing concurrently. If deadlock doesn't occur, all eval requests will eventually be serviced by an eval server that has finished processing a previous eval request.

6.8) How do I run my Network Linda program on a heterogeneous network?

Ntsnet has many features that support executing Network Linda programs on heterogeneous networks. The suffixstring resource (specified in the configuration file) can be used to tell ntsnet to use a different executable file on different machines. For example, with the configuration file

�

  ntsnet.Appl.hp1.suffixstring: .hp
		  ntsnet.Appl.hp2.suffixstring: .hp
		  ntsnet.Appl.mysparc.suffixstring: .sparc
		  ntsnet.Appl.myrs6k.suffixstring: .rs6k

the command

�

  tsnet ping 100

will cause ntsnet to use three different executables, ping.hp, ping.sparc, and ping.rs6k on the four different nodes.

Also, map files can be used to equivalence a set of directories in such a way that a different directory is used for each platform. For example, with the map file

  map /usr/bin/linda {
		    hp1 hp2 : /usr/bin/linda/hp;
		    mysparc : /usr/bin/linda/sparc;
		    myrs6k : /usr/bin/linda/rs6k;
		  }

the command

  tsnet /usr/bin/linda/hp/ping 100

executed on hp1 will cause ntsnet to use three different executables,
/usr/bin/linda/hp/ping, /usr/bin/linda/sparc/ping, and
/usr/bin/linda/rs6k/ping on the four different nodes.

6.9) Can I execute Network Linda programs without using the rshd daemon?

Yes, ntsnet uses the linda_rsh shell script to insulate it from the actual command used for remote execution. Linda_rsh can be modified by the user, but the supplied version supports both "rsh" and "on". Ntsnet passes the value of the lindarsharg resource to linda_rsh for each remote process, so either "rsh" or "on" can be used. For example, with the configuration file

�

  ntsnet.Node.lindarsharg: on
		  ntsnet.mydec.lindarsharg: rsh

ntsnet will cause linda_rsh to use "on" for all nodes except "mydec", since DECstations don't support "on".

If linda_rsh is modified by the user to support another remote execution command, the lindarsharg resource can still be used to let linda_rsh choose what command to use. Ntsnet just passes the appropriate value of the lindarsharg resource for each node.

�

6.10) Why do I get the message:

  "ntsnet: shutting down with return code 9"

when my Network Linda program finishes?

Either your real_main function explicitly or implicitly returned a value other than zero. The real_main function is defined to return an integer value. If there is no return in real_main, an undefined value will be returned to ntsnet, which is reported by ntsnet.

6.11) How can I tell ntsnet exactly how many processes to schedule on each node?

Try putting something like the following in your tsnet.config file:

  ! These settings are for "manual mode" scheduling.
		  ! The speedfactor and minworkers values are necessary default values.
		  ntsnet.Appl.getload: False
		  ntsnet.Appl.maxprocspernode: 1000000
		  ntsnet.Node.speedfactor: 1.0
		  ntsnet.Appl.maxworkers: 1000000
		  ntsnet.Appl.minworkers: 1

		  ! These settings reflect the desire to not count the master
		  ! process, and to run one worker per node.
		  ntsnet.Appl.masterload: 0.0
		  ntsnet.Node.threshold: 1.0

You can now use the threshold resource to control how many workers are scheduled on a given node. For example, if you want to schedule three processes on node "frank", just add the line:

  ntsnet.frank.threshold: 3.0

If you decide that you want to include the master process in the count, just remove the line that sets masterload to 0.0 (it defaults to 1.0).

Note that by setting maxworkers to a million, ntsnet schedules as many processes as it can on each node. The threshold acts as the limit. With minworkers set to one, ntsnet doesn't consider it an error to only schedule, say ten processes, rather than one million.

6.12) How can I tell ntsnet to run one eval server on every node, including the local node (so that the local node executes both the realmain() process and a worker process)?

It can be done using the same basic technique described in the previous answer. Try putting something like the following in your tsnet.config file:

  ntsnet.Appl.getload: False
		  ntsnet.Appl.maxprocspernode: 2
		  ntsnet.Node.speedfactor: 1.0
		  ntsnet.Appl.maxworkers: 1000000
		  ntsnet.Appl.minworkers: 1
		  ntsnet.Appl.masterload: 0.0
		  ntsnet.Node.threshold: 1.0

6.13) Why do I get the message:

  "ntsnet: WARNING: ping appears to be incompatible with ntsnet"

This means that your Network Linda program was built with different version of Network Linda than the one that you're using to execute it. This is only a warning, but if your program doesn't work properly, this is probably the reason.

6.14) Why does my program take so long to start executing?

This is usually because rsh/rshd is taking a long time. Rshd is usually slow because it reads and executes the user's .cshrc file on the remote machine. It is generally a good idea to modify the .cshrc file to not do very much when invoked via rsh. This is described in Subject 6.4. Basically, put something like

if ($?prompt == 0) then
		  exit
		endif

near the beginning of .cshrc (probably after setting path). This can sometimes avoid extra work and speed execution.

However, the real problem may be that it is taking a very long time for rshd to even start reading .cshrc, due to the way your home directory is configured. For example, if your home directory is auto mounted by remote machines on your network, when you start executing your Network Linda program, all the remote machines will have to mount the exported partitions of your local machine.

One solution is to hard mount, rather than auto mount, your home directory on the remote machines. Another solution is to make the home directory local on each of the remote machines. Your home
directory can still contain symbolic links to common directories, but if .cshrc is local to each remote machine, start up time for your Network Linda program can be much faster.

It is a good idea to test the speed of rsh with a simple example, such as

  example% rsh remotenode date

Once rsh runs faster, there may still be a problem due to the Network Linda executable not being local to the remote machines. The best thing is to distribute the executable to a local directory on all the remote machines once, and then use ntsnet to execute it from now on. This makes sense for production use, in particular.

There are many ways to distribute the executable. You can execute rcp directly, or use ftp. You can also use the ntsnet -distribute option with +cleanup to distribute the executable, but not delete it afterwards.

This distribution scheme assumes that the executable is in a directory that has the same path on all machines, but is local on all machines. A map file can be used to make different path names equivalent. You could try experiments with /tmp (which is usually local to each machine) to see if this is a reason for slow start up.

6.15) How do I set environment variables for remote Linda processes?

There are various ways that this can be done, but the different methods can depend on the remote execution mechanism that you're using. For instance, if you're using "on" (by setting the lindarsharg resource in your configuration file), then the local environment is automatically exported to remote processes. Rsh doesn't do that, so another mechanism is necessary. If csh is the default shell on a remote node, you can set environment variables in your remote .cshrc files, which are sourced bycsh before executing the Linda program. If you're using sh or ksh, this isn't possible, since .profile isn't sourced for non-interactive execution. In that case, you may need to create a modified version of linda_rsh to achieve the desired effect.

For example, you want to set the environment variable FOO to different values on different remote nodes. One way to do this is to add a few lines to linda_rsh, as follows:

  *) case "$rsh_arg" in
		     on) exec /usr/bin/on -n $host "$@"
		       ;;
		  +   DISPLAY=*) exec /usr/ucb/rsh $host $user -n $rsh_arg "$@"
		  +     ;;
		     *) exec /usr/ucb/rsh $host $user -n "$@"
		       ;;
		    esac
		    ;;

The lines prefixed by "+" are the new lines. Note that the path for rsh and on varies on different platforms, so don't use this verbatim.

Now add the following lines to ~/.tsnet.config:

�

  ntsnet.frank.lindarsharg: DISPLAY=biff:0
		  ntsnet.joe.lindarsharg: DISPLAY=chet:0
		  ntsnet.Node.lindarsharg: DISPLAY=junk:0

Now, when ntsnet executes linda_rsh to start a process on node frank, it will include the option "-r DISPLAY=biff:0", and on node joe, it will include the option "-r DISPLAY=chet:0". In both cases, the new line in linda_rsh will be used to execute the remote process, and DISPLAY will be set as specified in the remote processes.

Another method is to use wrap the Linda program in a shell script that determines what node it is running on, and sets any environment variables, and then execs the Linda program. The shell script could look something like:

#!/bin/sh
		case `hostname` in
		  frank*) DISPLAY=biff:0 ;;
		  joe*) DISPLAY=chet:0 ;;
		  *) DISPLAY=junk:0 ;;
		esac
		export DISPLAY
		exec /usr/linda/bin/foo "$@"

Note that using this method, you will see warning messages from ntsnet that you may not be executing a valid Network Linda program. This is correct, since ntsnet only sees the shell script, which isn't a Network Linda program. You can safely ignore the warnings.

We suggest that you use the ntsnet -vv option when debugging these kinds of changes.

See sections 6.4, 6.9, and 6.14 for more information on the topics of rsh and lindarsharg.

6.16) Why do I get the message:

      tsnet -n 3 suite
		      Internal Error... Opening passwd file in parse passwd file
		      Contact Customer Service at Scientific Computing Associates
		      TEL 203-777-7442
		      FAX 203-776-4074
		      EMAIL [email protected]
		      ntsnet: master process exited with return value 1

when trying to run my Linda program?

Check the permissions on the Linda license file <top>/lib/linda.lcn to be sure that it is readable by you.

6.17) Why do I get the message:

      tsnet -n 2 -suffix ping
		      Linda Error: node sol1 (-1): hostname not found in configuration file
		      ntsnet: worker on node sol1.sca.com exited abnormally

		       cl-examples
		      ping.cl ping.sun ping.sol

		      % more ~/.tsnet.config
		      Tsnet.Appl.nodelist: sun1 sun2 sol1
		      Tsnet.Appl.Node.suffixstring: .sun
		      Tsnet.Appl.sol1.suffixstring: .sol

when trying to run my Linda program heterogenously?

This can occur if your Network Linda executables were built with different version of the Linda compiler. To run heterogenously, all Linda executables need to be compiled with the same version of the clc compiler.

TOC, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12 13, 14