This is an outdated version of the HTCondor Manual. You can find current documentation at http://htcondor.org/manual.
next up previous contents index
Next: 7.5 Grid Computing Up: 7. Frequently Asked Questions Previous: 7.3 Running Condor Jobs   Contents   Index

Subsections

7.4 Condor on Windows

Will Condor work on a network of mixed Unix and Windows machines?

You can have a Condor pool that consists of both Unix and Windows machines.

Your central manager can be either Windows or Unix. For example, even if you had a pool consisting strictly of Unix machines, you could use a Windows box for your central manager, and vice versa.

Submitted jobs can originate from either a Windows or a Unix machine, and be destined to run on Windows or a Unix machine. Note that there are still restrictions on the supported universes for jobs executed on Windows machines.

So, in summary:

  1. A single Condor pool can consist of both Windows and Unix machines.

  2. It does not matter at all if your Central Manager is Unix or Windows.

  3. Unix machines can submit jobs to run on other Unix or Windows machines.

  4. Windows machines can submit jobs to run on other Windows or Unix machines.

What versions of Windows will Condor run on?

See Section 1.5, on page [*].

My Windows program works fine when executed on its own, but it does not work when submitted to Condor.

First, make sure that the program really does work outside of Condor under Windows, that the disk is not full, and that the system is not out of user resources.

As the next consideration, know that some Windows programs do not run properly because they are dynamically linked, and they cannot find the .dll files that they depend on. Version 6.4.x of Condor sets the PATH to be empty when running a job. To avoid these difficulties, do one of the following

  1. statically link the application
  2. wrap the job in a script that sets up the environment
  3. submit the job from a correctly-set environment with the command
    getenv = true
    
    in the submit description file. This will copy your environment into the job's environment.
  4. send the required .dll files along with the job using the submit description file command transfer_input_files.

Why is the condor_master daemon failing to start, giving an error about
"In StartServiceCtrlDispatcher, Error number: 1063"?

In Condor for Windows, the condor_master daemon is started as a service. Therefore, starting the condor_master daemon as you would on Unix will not work. Start Condor on Windows machines using either
	net start condor
or start the Condor service from the Service Control Manager located in the Windows Control Panel.

Jobs submitted from Windows give an error referring to a credential.

Jobs submitted from a Windows machine require a stashed password in order for Condor to perform certain operations on the user's behalf. Refer to section 6.2.3 for information about password storage on Windows. The command which stashes a password for a user is condor_store_cred. See the manual page on on page [*] for usage details.

The error message that Condor gives if a user has not stashed a password is of the form:

ERROR: No credential stored for username@machinename

        Correct this by running:
	        condor_store_cred add

Jobs submitted from Unix to execute on Windows do not work properly.

A difficulty with defaults causes jobs submitted from Unix for execution on a Windows platform to remain in the queue, but make no progress. For jobs with this problem, log files will contain error messages pointing to shadow exceptions.

This difficulty stems from the defaults for whether file transfer takes place. The workaround for this problem is to place the lines

   should_transfer_files = YES
   when_to_transfer_output = ON_EXIT
into the submit description file for jobs submitted from a Unix machine for execution on a Windows machine.

When I run condor_status I get a communication error, or the Condor daemon log files report a failure to bind.

Condor uses the first network interface it sees on your machine. This problem usually means you have an extra, inactive network interface (such as a RAS dial up interface) defined before the regular network interface.

To solve this problem, either change the order of the network interfaces in the Control Panel, or explicitly set which network interface Condor should use by adding the following definition to the Condor configuration file:

NETWORK_INTERFACE = <ip-address>

Where <ip-address> is the IP address of the interface that Condor is to use.

My job starts but exits right away with status 128.

This can occur when the machine your job is running on is missing a DLL (Dynamically Linked Library) required by your program. The solution is to find the DLL file the program needs and put it in the TRANSFER_INPUT_FILES list in the job's submit file.

To find out what DLLs your program depends on, right-click the program in Explorer, choose Quickview, and look under ``Import List''.

How can I access network files with Condor on Windows?

Five methods for making access of network files work with Condor are given in section 6.2.10.

What is wrong when condor_off cannot find my host, and condor_status does not give me a complete host name?

Given the command

  condor_off hostname2
an error message of the form
  Can't find address for master hostname2.somewhere.edu
appears. Yet, when looking at the host names with
  condor_status -master
the output is of the form
  hostname1.somewhere.edu
  hostname2
  hostname3.somewhere.edu

To correct this incomplete host name, add an entry to the configuration file for DEFAULT_DOMAIN_NAME that specifies the domain name to be used. For the example given, the configuration entry will be

  DEFAULT_DOMAIN_NAME = somewhere.edu

After adding this configuration file entry, use condor_restart to restart the Condor daemons and effect the change.

Does USER_JOB_WRAPPER work on Windows machines?

The USER_JOB_WRAPPER configuration variable does work on Windows machines. The wrapper must be either a batch script with a file name extension of .bat or .cmd, or an executable with a file name extension of .exe or .com.

An example of a batch script sets environment variables:

REM set some environment variables
set LICENSE_SERVER=192.168.1.202:5012
set MY_PARAMS=2

REM Run the actual job now
%*

condor_store_cred is failing, and I'm sure I'm typing my password correctly.

First, make sure the condor_schedd daemon is running.

Next, check the log file written by the condor_schedd daemon. It will contain more detailed information about the failure. Frequently, the error is a result of PERMISSION DENIED errors. More information about proper configuration of security settings is on page [*].

My submit machine cannot have more than 120 jobs running concurrently. Why?

Windows is likely to be running out of desktop heap. Confirm this to be the case by looking in the log for the condor_schedd daemon to see if condor_shadow daemons are immediately exiting with status 128. If this is the case, increase the desktop heap size. Open the registry key:

HKEY_LOCAL_MACHINE\System\CurrentControlSet\Control\Session Manager\SubSystems\Windows

The SharedSection value can have three values separated by commas. The third value controls the desktop heap size for non-interactive desktops, which the Condor service uses. The default is 512 (Kbytes). 60 condor_shadow daemons consume about 256 Kbytes, hence 120 shadows can run with the default value. To be able to run a maximum of 300 condor_shadow daemons, set this value at 1280.

Reboot the system for the changes to take effect. For more information, see Microsoft Article Q184802.

Why do Condor daemons exit after logging a 10038 (WSAENOTSOCK) error on some machines?

Usually when Condor daemons exit in this manner, it is because the system in question has a non-standard Winsock Layered Service Provider (LSP) installed on it. An LSP is, in effect, a plug-in for the TCP/IP protocol stack. LSPs have been installed as part of anti-virus software and other security-related packages.

There are several tools available to check your system for the presence of LSPs. One with which we have had success is LSP-Fix, available at http://www.cexx.org/lspfix.htm. Any non-Microsoft LSPs identified by this tool may potentially be causing the WSAENOTSOCK error in Condor. Although the LSP-Fix tool allows the direct removal of an LSP, it is likely advisable to completely remove the application for which the LSP is a part via the Control Panel.

Another approach is to completely reset the TCP/IP stack to its original state. This can be done using the netsh tool:

netsh int ip reset reset-stack.log
The command will return the TCP/IP stack back to the state is was in when the OS was first installed. The log file defined above will record all the configuration changes made by netsh.

Why do Condor daemons exit with "Unexpected performance counter size", "unable to spawn the ProcD" or "loadavg thread died, restarting. (exit code=2)" errors?

Condor on Windows platforms relies on built-in performance counters for its operation. If performance counters that Condor requires are disabled, daemons may exit with a message such as

1/26 09:16:42 (fd:2) (pid:5732) ERROR: "Unexpected performance counter
    size for total CPU: 0 (expected: 8)" at line 2846 in file
    ..\src\condor_procapi\procapi.cpp

or

1/20 15:29:14 (pid:2484) ERROR "unable to spawn the ProcD" at line 136
    in file ..\src\condor_c++_util\proc_family_proxy.C

and even

4/16 10:49:13 loadavg thread died, restarting. (exit code=2)

To enable the performance counters, check the registry key

HKEY_LOCAL_MACHINE\SYSTEM\CurrentControlSet\Services\PerfProc\Performance
If a value for Disable Performance Counters exists, delete it or set it to 0.

Why does the Windows Installer fail with ``Error 2738. Could not access VBScript run time for custom action''?

This error results when the VBScript engine is not registered. Since Condor's installer depends on the VBScript engine for custom steps, the installer will fail if it cannot find the VBScript engine.

The fix is to register the VMScript engine. With Administrative privilege:

  1. Launch the Command Prompt (cmd.exe) as the Administrator.
  2. At the Command Prompt, change directories to the System32 folder, within the Windows folder.
  3. Issue the command
      regsvr32 vbscript.dll
    

If successful, the message

  DllRegisterServer in vbscript.dll succeeded.
is printed.

Why does Condor sometimes fail to parse floating point numbers?

Condor assumes that all floating point numbers are of the form x.y, which, depending on a computer's current locale, may not always be the case. This problem occurs even if Condor is running under an account that has had the locale configured correctly. The problem lies in the template user account which is used to create Condor's dynamic accounts. Even if the entire system is configured to use a new locale, this template account seems to retain the original system locale. The following steps can be used fix this problem.

To create a default user profile, you must be logged on as Administrator or be a member of the Administrators group. Create a new user profile for all new user accounts on a computer to be based on. To create subsequent profiles, you can use the new user account as a template. Here is how to use the new user profile as a template account to use as a new user's profile:

  1. Log on to the computer as the new user, and customize the desktop if appropriate.
  2. Optionally, install and configure any applications to be shared by user accounts made from this template.
  3. Log off, and then log on as the Administrator.
  4. In the Control Panel, open the System Control Panel applet.
  5. On the Advanced tab, under User Profiles, click Settings.
  6. Under Profiles stored on this computer, select the user you created to be the template, and then click Copy To.
  7. To create the default user profile for the computer, type the path to the default user:
  8. In the Copy To dialog box, under Permitted to use, click Change.
  9. In the Select User or Group dialog box, in the Enter the object name to select text box, type: Everyone and click OK.
  10. Click OK to dismiss the Copy To dialog box.
  11. Click OK again to dismiss the User Profiles dialog box.
  12. Finally, click OK one last time to dismiss the System Properties dialog.

If Condor has already created some dynamic accounts, you will need to remove them so that Condor can re-create them with the new template account.


next up previous contents index
Next: 7.5 Grid Computing Up: 7. Frequently Asked Questions Previous: 7.3 Running Condor Jobs   Contents   Index
condor-admin@cs.wisc.edu