⇐ ↙ ↓ ⇑ ⇒ Contents Index10.2 Using HTCondor with the Hadoop File System
The Hadoop project is an Apache project, headquartered at http://hadoop.apache.org, which implements an
open-source, distributed file system across a large set of machines. The file system proper is called the Hadoop File
System, or HDFS, and there are several Hadoop-provided tools which use the file system, most notably databases
and tools which use the map-reduce distributed programming style.
Distributed with the HTCondor source code, HTCondor provides a way to manage the daemons which
implement an HDFS, but no direct support for the high-level tools which run atop this file system. There are two
types of daemons, which together create an instance of a Hadoop File System. The first is called the Name node,
which is like the central manager for a Hadoop cluster. There is only one active Name node per HDFS. If the Name
node is not running, no files can be accessed. The HDFS does not support fail over of the Name node, but it does
support a hot-spare for the Name node, called the Backup node. HTCondor can configure one node to be
running as a Backup node. The second kind of daemon is the Data node, and there is one Data node per
machine in the distributed file system. As these are both implemented in Java, HTCondor cannot
directly manage these daemons. Rather, HTCondor provides a small DaemonCore daemon, called
condor_hdfs, which reads the HTCondor configuration file, responds to HTCondor commands like
condor_on and condor_off , and runs the Hadoop Java code. It translates entries in the HTCondor
configuration file to an XML format native to HDFS. These configuration items are listed with the
condor_hdfs daemon in section 10.2.1. So, to configure HDFS in HTCondor, the HTCondor configuration
file should specify one machine in the pool to be the HDFS Name node, and others to be the Data
nodes.
Once an HDFS is deployed, HTCondor jobs can directly use it in a vanilla universe job, by transferring input
files directly from the HDFS by specifying a URL within the job’s submit description file command
transfer_input_files. See section 3.14.2 for the administrative details to set up transfers specified by a URL. It
requires that a plug-in is accessible and defined to handle hdfs protocol transfers.
10.2.1 condor_hdfs Configuration File Entries
These macros affect the condor_hdfs daemon. Many of these variables determine how the condor_hdfs daemon sets
the HDFS XML configuration.
-
HDFS_HOME
- The directory path for the Hadoop file system installation directory. Defaults to
$(RELEASE_DIR)/libexec. This directory is required to contain
- directory lib, containing all necessary jar files for the execution of a Name node and Data nodes.
- directory conf, containing default Hadoop file system configuration files with names that conform
to *-site.xml.
- directory webapps, containing JavaServer pages (jsp) files for the Hadoop file system’s embedded
server.
-
HDFS_NAMENODE
- The host and port number for the HDFS Name node. There is no default value for this required
variable. Defines the value of fs.default.name in the HDFS XML configuration.
-
HDFS_NAMENODE_WEB
- The IP address and port number for the HDFS embedded web server within the Name node
with the syntax of a.b.c.d:portnumber. There is no default value for this required variable. Defines the value
of dfs.http.address in the HDFS XML configuration.
-
HDFS_DATANODE_WEB
- The IP address and port number for the HDFS embedded web server within the Data node
with the syntax of a.b.c.d:portnumber. The default value for this optional variable is 0.0.0.0:0, which means
bind to the default interface on a dynamic port. Defines the value of dfs.datanode.http.address in the
HDFS XML configuration.
-
HDFS_NAMENODE_DIR
- The path to the directory on a local file system where the Name node will store its meta-data
for file blocks. There is no default value for this variable; it is required to be defined for the Name node
machine. Defines the value of dfs.name.dir in the HDFS XML configuration.
-
HDFS_DATANODE_DIR
- The path to the directory on a local file system where the Data node will store file blocks.
There is no default value for this variable; it is required to be defined for a Data node machine. Defines the
value of dfs.data.dir in the HDFS XML configuration.
-
HDFS_DATANODE_ADDRESS
- The IP address and port number of this machine’s Data node. There is no default value
for this variable; it is required to be defined for a Data node machine, and may be given the value 0.0.0.0:0
as a Data node need not be running on a known port. Defines the value of dfs.datanode.address in the
HDFS XML configuration.
-
HDFS_NODETYPE
- This parameter specifies the type of HDFS service provided by this machine. Possible values are
HDFS_NAMENODE and HDFS_DATANODE. The default value is HDFS_DATANODE.
-
HDFS_BACKUPNODE
- The host address and port number for the HDFS Backup node. There is no default value. It
defines the value of the HDFS dfs.namenode.backup.address field in the HDFS XML configuration
file.
-
HDFS_BACKUPNODE_WEB
- The address and port number for the HDFS embedded web server within the Backup node,
with the syntax of hdfs://<host_address>:<portnumber>. There is no default value for this
required variable. It defines the value of dfs.namenode.backup.http-address in the HDFS XML
configuration.
-
HDFS_NAMENODE_ROLE
- If this machine is selected to be the Name node, then the role must be defined. Possible
values are ACTIVE, BACKUP, CHECKPOINT, and STANDBY. The default value is ACTIVE. The STANDBY value exists
for future expansion. If HDFS_NODETYPE is selected to be Data node (HDFS_DATANODE), then this variable is
ignored.
-
HDFS_LOG4J
- Used to set the configuration for the HDFS debugging level. Currently one of OFF, FATAL, ERROR,
WARN, INFODEBUG, ALL or INFO. Debugging output is written to $(LOG)/hdfs.log. The default value is
INFO.
-
HDFS_ALLOW
- A comma separated list of hosts that are authorized with read and write access to the invoked HDFS.
Note that this configuration variable name is likely to change to HOSTALLOW_HDFS.
-
HDFS_DENY
- A comma separated list of hosts that are denied access to the invoked HDFS. Note that this
configuration variable name is likely to change to HOSTDENY_HDFS.
-
HDFS_NAMENODE_CLASS
- An optional value that specifies the class to invoke. The default value is
org.apache.hadoop.hdfs.server.namenode.NameNode.
-
HDFS_DATANODE_CLASS
- An optional value that specifies the class to invoke. The default value is
org.apache.hadoop.hdfs.server.datanode.DataNode.
-
HDFS_SITE_FILE
- The optional value that specifies the HDFS XML configuration file to generate. The default value
is hdfs-site.xml.
-
HDFS_REPLICATION
- An integer value that facilitates setting the replication factor of an HDFS, defining the value of
dfs.replication in the HDFS XML configuration. This configuration variable is optional, as the HDFS has
its own default value of 3 when not set through configuration.
⇐ ↙ ↑ ⇑ ⇒ Contents Index