This section on network communication in Condor discusses which network ports are used, how Condor behaves on machines with multiple network interfaces and IP addresses, and how to facilitate functionality in a pool that spans firewalls and private networks.
The security section of the manual contains some information that is relevant to the discussion of network communication which will not be duplicated here, so please see section 3.6 as well.
Firewalls, private networks, and network address translation (NAT) pose special problems for Condor. There are currently two main mechanisms for dealing with firewalls within Condor:
Each method has its own advantages and disadvantages, as described below.
Every Condor daemon listens on a network port for incoming commands. (Using condor_shared_port, this port may be shared between multiple daemons.) Most daemons listen on a dynamically assigned port. In order to send a message, Condor daemons and tools locate the correct port to use by querying the condor_collector, extracting the port number from the ClassAd. One of the attributes included in every daemon's ClassAd is the full IP address and port number upon which the daemon is listening.
To access the condor_collector itself, all Condor daemons and tools must know the port number where the condor_collector is listening. The condor_collector is the only daemon with a well-known, fixed port. By default, Condor uses port 9618 for the condor_collector daemon. However, this port number can be changed (see below).
As an optimization for daemons and tools communicating with another daemon that is running on the same host, each Condor daemon can be configured to write its IP address and port number into a well-known file. The file names are controlled using the <SUBSYS>_ADDRESS_FILE configuration variables, as described in section 3.3.5 on page .
NOTE: In the 6.6 stable series, and Condor versions earlier than 6.7.5, the condor_negotiator also listened on a fixed, well-known port (the default was 9614). However, beginning with version 6.7.5, the condor_negotiator behaves like all other Condor daemons, and publishes its own ClassAd to the condor_collector which includes the dynamically assigned port the condor_negotiator is listening on. All Condor tools and daemons that need to communicate with the condor_negotiator will either use the NEGOTIATOR_ADDRESS_FILE or will query the condor_collector for the condor_negotiator's ClassAd.
Sites that configure any checkpoint servers will introduce other fixed ports into their network. Each condor_ckpt_server will listen to 4 fixed ports: 5651, 5652, 5653, and 5654. There is currently no way to configure alternative values for any of these ports.
CONDOR_HOST = machX.cs.wisc.edu COLLECTOR_HOST = $(CONDOR_HOST)the configuration might be
CONDOR_HOST = machX.cs.wisc.edu COLLECTOR_HOST = $(CONDOR_HOST):9650
If a non standard port is defined, the same value of COLLECTOR_HOST (including the port) must be used for all machines in the Condor pool. Therefore, this setting should be modified in the global configuration file (condor_config file), or the value must be duplicated across all configuration files in the pool if a single configuration file is not being shared.
When querying the condor_collector for a remote pool that is running on a non standard port, any Condor tool that accepts the -pool argument can optionally be given a port number. For example:
% condor_status -pool foo.bar.org:1234
On single machine pools, it is permitted to configure the condor_collector daemon to use a dynamically assigned port, as given out by the operating system. This prevents port conflicts with other services on the same machine. However, a dynamically assigned port is only to be used on single machine Condor pools, and only if the COLLECTOR_ADDRESS_FILE configuration variable has also been defined. This mechanism allows all of the Condor daemons and tools running on the same machine to find the port upon which the condor_collector daemon is listening, even when this port is not defined in the configuration file and is not known in advance.
To enable the condor_collector daemon to use a dynamically assigned port, the port number is set to 0 in the COLLECTOR_HOST variable. The COLLECTOR_ADDRESS_FILE configuration variable must also be defined, as it provides a known file where the IP address and port information will be stored. All Condor clients know to look at the information stored in this file. For example:
COLLECTOR_HOST = $(CONDOR_HOST):0 COLLECTOR_ADDRESS_FILE = $(LOG)/.collector_address
NOTE: Using a port of 0 for the condor_collector and specifying a COLLECTOR_ADDRESS_FILE only works in Condor version 6.6.8 or later in the 6.6 stable series, and in version 6.7.4 or later in the 6.7 development series. Do not attempt to do this with older versions of Condor.
Configuration definition of COLLECTOR_ADDRESS_FILE is in section 3.3.5 on page , and COLLECTOR_HOST is in section 3.3.3 on page .
If a Condor pool is completely behind a firewall, then no special consideration or port usage is needed. However, if there is a firewall between the machines within a Condor pool, then configuration variables may be set to force the usage of specific ports, and to utilize a specific range of ports.
By default, Condor uses port 9618 for the condor_collector daemon, and dynamic (apparently random) ports for everything else. See section 3.7.1, if a dynamically assigned port is desired for the condor_collector daemon.
All of the Condor daemons on a machine may be configured to share a single port. See section 3.3.36 for more information.
The configuration variables HIGHPORT and LOWPORT facilitate setting a restricted range of ports that Condor will use. This may be useful when some machines are behind a firewall. The configuration macros HIGHPORT and LOWPORT will restrict dynamic ports to the range specified. The configuration variables are fully defined in section 3.3.6. All of these ports must be greater than 0 and less than 65,536. Note that both HIGHPORT and LOWPORT must be at least 1024 for Condor version 6.6.8. In general, use ports greater than 1024, in order to avoid port conflicts with standard services on the machine. Another reason for using ports greater than 1024 is that daemons and tools are often not run as root, and only root may listen to a port lower than 1024. Also, the range must include enough ports that are not in use, or Condor cannot work.
The range of ports assigned may be restricted based on incoming (listening) and outgoing (connect) ports with the configuration variables IN_HIGHPORT , IN_LOWPORT , OUT_HIGHPORT , and OUT_LOWPORT . See section 3.3.6 for complete definitions of these configuration variables. A range of ports lower than 1024 for daemons running as root is appropriate for incoming ports, but not for outgoing ports. The use of ports below 1024 (versus above 1024) has security implications; therefore, it is inappropriate to assign a range that crosses the 1024 boundary.
NOTE: Setting HIGHPORT and LOWPORT will not automatically force the condor_collector to bind to a port within the range. The only way to control what port the condor_collector uses is by setting the COLLECTOR_HOST (as described above).
The total number of ports needed depends on the size of the pool, the usage of the machines within the pool (which machines run which daemons), and the number of jobs that may execute at one time. Here we discuss how many ports are used by each participant in the system. This assumes that condor_shared_port is not being used. If it is being used, then all daemons can share a single incoming port.
The central manager of the pool needs 5 + NEGOTIATOR_SOCKET_CACHE_SIZE ports for daemon communication, where NEGOTIATOR_SOCKET_CACHE_SIZE is specified in the configuration or defaults to the value 16.
Each execute machine (those machines running a condor_startd daemon) requires 5 + (5 * number of slots advertised by that machine) ports. By default, the number of slots advertised will equal the number of physical CPUs in that machine.
Submit machines (those machines running a condor_schedd daemon) require 5 + (5 * MAX_JOBS_RUNNING) ports. The configuration variable MAX_JOBS_RUNNING limits (on a per-machine basis, if desired) the maximum number of jobs. Without this configuration macro, the maximum number of jobs that could be simultaneously executing at one time is a function of the number of reachable execute machines.
Also be aware that HIGHPORT and LOWPORT only impact dynamic port selection used by the Condor system, and they do not impact port selection used by jobs submitted to Condor. Thus, jobs submitted to Condor that may create network connections may not work in a port restricted environment. For this reason, specifying HIGHPORT and LOWPORT is not going to produce the expected results if a user submits MPI applications to be executed under the parallel universe.
Where desired, a local configuration for machines not behind a firewall can override the usage of HIGHPORT and LOWPORT, such that the ports used for these machines are not restricted. This can be accomplished by adding the following to the local configuration file of those machines not behind a firewall:
HIGHPORT = UNDEFINED LOWPORT = UNDEFINED
If the maximum number of ports allocated using HIGHPORT and LOWPORT is too few, socket binding errors of the form
failed to bind any port within <$LOWPORT> - <$HIGHPORT>are likely to appear repeatedly in log files.
The condor_shared_port is an optional daemon responsible for creating a TCP listener port shared by all of the Condor daemons for which the configuration variable USE_SHARED_PORT is True. The condor_master will invoke the condor_shared_port daemon if it is listed in DAEMON_LIST.
The main purpose of the condor_shared_port daemon is to reduce the number of ports that must be opened when Condor needs to be accessible through a firewall. This has a greater security benefit than simply reducing the number of open ports. Without the condor_shared_port daemon, one can configure Condor to use a range of ports, but since some Condor daemons are created dynamically, this full range of ports will not be in use by Condor at all times. This implies that other non-Condor processes not intended to be exposed to the outside network could unintentionally bind to ports in the range intended for Condor, unless additional steps are taken to control access to those ports. While the condor_shared_port daemon is running, it is exclusively bound to its port, which means that other non-Condor processes cannot accidentally bind to that port.
A secondary benefit of the condor_shared_port daemon is that it helps address the scalability issues of a submit machine. Without the condor_shared_port daemon, approximately 2.1 ephemeral ports per running job are required, and possibly more, depending on the rate of job completion. There are only 64K ports in total, and most standard Unix installations only allocate a subset of these as ephemeral ports. In practice, with long running jobs, and with between 11K and 14K simultaneously running jobs, port exhaustion has been observed in typical Linux installations. After increasing the ephemeral port range as to as many as possible, port exhaustion occurred between 20K and 25K running jobs. Using the condor_shared_port daemon, each running job requires fewer, approximately 1.1 ephemeral ports on the submit node, if Condor on the submit node connects directly to Condor on the execute node. If the submit node connects via CCB to the execute node, no ports are required per running job; only the one port allocated to the condor_shared_port daemon is used.
When CCB is utilized via setting the configuration variable CCB_ADDRESS , the condor_shared_port daemon registers with the CCB server on behalf of all daemons sharing the port. This means that it is not possible to individually enable or disable CCB connectivity to daemons that are using the shared port; they all effectively share the same setting, and the condor_shared_port daemon handles all CCB connection requests on their behalf.
Condor's authentication and authorization steps are unchanged by the use of a shared port. Each Condor daemon continues to operate according to its configured policy. Requests for connections to the shared port are not authenticated or restricted by the condor_shared_port daemon. They are simply passed to the requested daemon, which is then responsible for enforcing the security policy.
When the condor_master is configured to use the shared port by setting the configuration variable
USE_SHARED_PORT = Truethe condor_shared_port daemon is treated specially. A command such as condor_off, which shuts down all daemons except for the condor_master, will also leave the condor_shared_port running. This prevents the condor_master from getting into a state where it can no longer receive commands.
The condor_collector daemon typically has its own port; it uses 9618 by default. However, it can be configured to use a shared port. Since the address of the condor_collector must be set in the configuration file, it is necessary to specify the shared port socket name of the condor_collector, so that connections to the shared port that are intended for the condor_collector can be forwarded to it. If the shared port number is 11000, a condor_collector address using this shared port could be configured:
COLLECTOR_HOST = collector.host.name:11000?sock=collector
This configuration assumes that the socket name used by the condor_collector is collector. The condor_collector that runs on collector.host.name will automatically choose this socket name if COLLECTOR_HOST is configured as in the example above. If multiple condor_collector daemons are started on the same machine, the socket name can be explicitly set in the daemon arguments, as in the example:
COLLECTOR_ARGS = -sock collector
When the condor_collector address is a shared port, TCP updates will be automatically used instead of UDP. Under Unix, this means that the condor_collector daemon should be configured to have enough file descriptors. See section 3.7.4 for more information on using TCP within Condor.
SOAP commands cannot be sent over a shared port.
However, a daemon may be configured to open a fixed, non-shared port,
in addition to using a shared port.
This is done both by setting
USE_SHARED_PORT = True and by specifying a fixed port for the daemon
using <SUBSYS>_ARGS = -p <portnum>
.
The TCP connections required to manage standard universe jobs do not make use of shared ports.
Condor can run on machines with multiple network interfaces. Starting with Condor version 6.7.13 (and therefore all Condor 6.8 and more recent versions), new functionality is available that allows even better support for multi-homed machines, using the configuration variable BIND_ALL_INTERFACES . A multi-homed machine is one that has more than one NIC (Network Interface Card). Further improvements to this new functionality will remove the need for any special configuration in the common case. For now, care must still be given to machines with multiple NICs, even when using this new configuration variable.
Machines can be configured such that whenever Condor daemons or tools call bind(), the daemons or tools use all network interfaces on the machine. This means that outbound connections will always use the appropriate network interface to connect to a remote host, instead of being forced to use an interface that might not have a route to the given destination. Furthermore, sockets upon which a daemon listens for incoming connections will be bound to all network interfaces on the machine. This means that so long as remote clients know the right port, they can use any IP address on the machine and still contact a given Condor daemon.
This functionality is on by default. To disable this functionality, the boolean configuration variable BIND_ALL_INTERFACES is defined and set to False:
BIND_ALL_INTERFACES = FALSE
This functionality has limitations. Here are descriptions of the limitations.
Currently, Condor daemons can only advertise two IP addresses in the ClassAd they send to their condor_collector. One is the public IP address and the other is the private IP address. Condor tools and other daemons that wish to connect to the daemon will use the private IP address if they are configured with the same private network name, and they will use the public IP address otherwise. So, even if the daemon is listening on 3 or more different interfaces, each with a separate IP, the daemon must choose which two IP addresses to advertise so that other daemons and tools can connect to it.
By default, Condor advertises the IP address of the network interface used to contact the condor_collector as its public address, since this is the most likely to be accessible to other processes that query the same condor_collector. The NETWORK_INTERFACE configuration variable can be used to specify the public IP address Condor should advertise, and PRIVATE_NETWORK_INTERFACE , along with PRIVATE_NETWORK_NAME can be used to specify the private IP address to advertise.
Sites that make heavy use of private networks and multi-homed machines should consider if using the Condor Connection Broker, CCB, is right for them. More information about CCB and Condor can be found in section on page .
Often users of Condor wish to set up compute farms where there is one machine with two network interface cards (one for the public Internet, and one for the private net). It is convenient to set up the head node as a central manager in most cases and so here are the instructions required to do so.
Setting up the central manager on a machine with more than one NIC can be a little confusing because there are a few external variables that could make the process difficult. One of the biggest mistakes in getting this to work is that either one of the separate interfaces is not active, or the host/domain names associated with the interfaces are incorrectly configured.
Given that the interfaces are up and functioning, and they have good host/domain names associated with them here is how to configure Condor:
In this example, farm-server.farm.org maps to the private interface. In the central manager's global (to the cluster) configuration file:
CONDOR_HOST = farm-server.farm.org
In the central manager's local configuration file:
NETWORK_INTERFACE = <IP address of farm-server.farm.org> NEGOTIATOR = $(SBIN)/condor_negotiator COLLECTOR = $(SBIN)/condor_collector DAEMON_LIST = MASTER, COLLECTOR, NEGOTIATOR, SCHEDD, STARTD
If the central manager and farm machines are all NT, then only vanilla universe will work now. However, if this is set up for Unix, then at this point, standard universe jobs should be able to function in the pool. But, if UID_DOMAIN is not configured to be homogeneous across the farm machines, the standard universe jobs will run as nobody on the farm machines.
In order to get vanilla jobs and file server load balancing for standard universe jobs working (under Unix), do some more work both in the cluster you have put together and in Condor to make everything work. First, you need a file server (which could also be the central manager) to serve files to all of the farm machines. This could be NFS or AFS, and it does not really matter to Condor. The mount point of the directories you wish your users to use must be the same across all of the farm machines. Now, configure UID_DOMAIN and FILESYSTEM_DOMAIN to be homogeneous across the farm machines and the central manager. Inform Condor that an NFS or AFS file system exists and that is done in this manner. In the global (to the farm) configuration file:
# If you have NFS USE_NFS = True # If you have AFS HAS_AFS = True USE_AFS = True # if you want both NFS and AFS, then enable both sets above
Now, if the cluster is set up so that it is possible for a machine name to never have a domain name (for example, there is machine name but no fully qualified domain name in /etc/hosts), configure DEFAULT_DOMAIN_NAME to be the domain that is to be added on to the end of the host name.
If client machine has two or more NICs, then there might be a specific network interface on which the client machine desires to communicate with the rest of the Condor pool. In this case, the local configuration file for the client should have
NETWORK_INTERFACE = <IP address of desired interface>
If a checkpoint server is on a machine with multiple interfaces, then 2 items must be correct to get things to work:
TCP sockets are reliable, connection-based sockets that guarantee the delivery of any data sent. However, TCP sockets are fairly expensive to establish, and there is more network overhead involved in sending and receiving messages.
UDP sockets are datagrams, and are not reliable. There is very little overhead in establishing or using a UDP socket, but there is also no guarantee that the data will be delivered. Typically, the lack of guaranteed delivery for UDP does not cause problems for Condor.
Condor can be configured to use TCP sockets to send updates to the condor_collector instead of UDP datagrams. This feature is intended for sites where UDP updates are lost because of the underlying network. An example where this may happen is if the pool is comprised of machines across a wide area network (WAN) where UDP packets are observed to be frequently dropped.
To enable the use of TCP sockets, the following configuration setting is used:
When there are sufficient file descriptors, the condor_collector leaves established TCP sockets open, facilitating better performance. Subsequent updates can reuse an already open socket.
Each Condor daemon will have 1 socket open to the condor_collector. So, in a pool with N machines, each of them running a condor_master, condor_schedd, and condor_startd, the condor_collector would need at least 3*N file descriptors. If the condor_collector is also acting as a CCB server, it will require an additional file descriptor for each registered daemon. In typical Unix installations, the default number of file descriptors available to the condor_collector is only 1024. This can be modified with a configuration setting such as the following:
COLLECTOR_MAX_FILE_DESCRIPTORS = 1600
If there are not sufficient file descriptors for all of the daemons sending updates to the condor_collector, a warning will be printed in the condor_collector log file. Look for the string file descriptor safety level exceeded.
NOTE: At this time, UPDATE_COLLECTOR_WITH_TCP only affects the main condor_collector for the site, not any sites that a condor_schedd might flock to.