Sunday, January 4, 2015

Setting up Hadoop in pseudo distributed mode and integrate with WSO2 BAM

One of the most fundamental tasks of WSO2 BAM is data analysis. WSO2 BAM implements data analysis using an Apache Hadoop-based big data analytics framework, which uses the highly-scalable, MapReduce technology underneath it. 

By default, BAM ships with an embedded Hadoop instance (running in local mode) which starts and stops with BAM. But when running on a production environment, it is recommended to configure BAM with an external multi-node Hadoop cluster which is highly available and scalable. Read here for more information on setting up BAM with multi-node Hadoop cluster.

Other than the embedded mode and fully distributed(multi-node) mode Hadoop can also run in pseudo-distributed mode where each Hadoop daemon(NameNode, Secondary NameNode, DataNode, JobTracker, TaskTracker) runs in a separate Java process. 

In this post, I am going to discuss how to setup Hadoop in pseudo-distributed mode and integrate with WSO2 BAM. This will be useful if you want to simulate a BAM + Hadoop cluster on a small scale or if you are in a need of connecting BAM to external Hadoop cluster but you have only one server available.

Installing & Configuring Hadoop in Pseudo Distributed Mode


1. Install Java in a location that all the user groups can access

Java location used in this example is /usr/local/java/jdk1.6.0_38.  

2. Create a dedicated system user

- Creating group 'hadoop'.
    $ sudo addgroup hadoop

Adding a new user 'hduser' into group 'hadoop'
    $ sudo adduser --ingroup hadoop hduser

3. Configuring SSH to localhost

- Install OpenSSH if not installed already
    $ sudo apt-get install openssh-server

- Login as hduser
    $ su - hduser

- Generate a passphrase less SSH public key for Hadoop.
    $ ssh-keygen -t rsa -P ‘’

- Append this public key to the authorized_keys file.
    $ cat $HOME/.ssh/id_rsa.pub >> $HOME/.ssh/authorized_keys

- Verify passwordless ssh configuration using the command.
    $ ssh localhost

4. Installing Hadoop

- Exit from from hduser if you have logged in as hduser.
    $ exit

- Download Hadoop 1.2.1 distribution from here.

- Extract Hadoop distribution
    $ tar -zxvf hadoop-1.2.1.tar.gz

- Move it to a place where all users can access
    $ sudo mv hadoop-1.2.1 /usr/local/

- Give ownership to user 'hduser' 
    $ cd /usr/local
    $ chown -R hduser:hadoop hadoop-1.2.1

5. Setting JAVA_HOME/PATH variables for user 'hduser'

- Login as hduser
    $ su - hduser

- Add the following lines to the end of the $HOME/.bashrc file of user 'hduser'
    export JAVA_HOME=/usr/local/java/jdk1.6.0_38
    export PATH=$PATH:$JAVA_HOME/bin

6. Configuring Hadoop

- Define JAVA_HOME in <HADOOP_HOME>/conf/hadoop-env.sh file:
    export JAVA_HOME=/usr/local/java/jdk1.6.0_38

- Edit the <HADOOP_HOME>/conf/core-site.xml file as follows:

<configuration> <property> <name>fs.default.name</name> <value>hdfs://localhost:9000</value> </property> <property> <name>fs.hdfs.impl</name> <value>org.apache.hadoop.hdfs.DistributedFileSystem</value> <description>The FileSystem for hdfs: uris.</description> </property> <property> <name>hadoop.tmp.dir</name> <value>/home/hduser/hdfstmp</value> </property> </configuration>

- Make hdfstmp directory from hduser user home dir
    $ mkdir hdfstmp

- Edit the <HADOOP_HOME>/conf/hdfs-site.xml as follows:

<configuration> <property> <name>dfs.replication</name> <value>1</value> </property> <property> <name>dfs.name.dir</name> <value>/usr/local/hadoop-1.2.1/dfs/name</value> </property> <property> <name>dfs.data.dir</name> <value>/usr/local/hadoop-1.2.1/dfs/data</value> </property> <property> <name>dfs.permissions</name> <value>false</value> </property> </configuration>

- Edit <HADOOP_HOME>/conf/mapred-site.xml as follows:

<configuration> <property> <name>mapred.job.tracker</name> <value>localhost:9001</value> </property> <property> <name>mapred.system.dir</name> <value>/usr/local/hadoop-1.2.1/mapred/system</value> </property> <property> <name>mapred.local.dir</name> <value>/usr/local/hadoop-1.2.1/mapred/local</value> </property> </configuration>

7. Formatting HDFS Namenode

- Login as hduser if you haven't already logged in.
    $ su - hduser

- From the Hadoop installation directory, execute the following command to format the  namenode: 
    $ bin/hadoop namenode -format

Note - This formats the HDFS namenode file system at the first run. You'll have to do this ONLY ONCE.

8. Starting and Stopping Hadoop

- Start the Hadoop cluster from the <HADOOP_HOME>/bin directory using the command: 

    $ sh start-all.sh

Note: Above command will start all Hadoop daemons simultaneously. Check the Hadoop daemon logs found at <HADOOP_HOME>/logs directory to check whether everything works perfectly. 

- To stop the Hadoop cluster, execute the following commane from the <HADOOP_HOME>/bin directory:

    $ sh stop-all.sh

9. Hadoop Web Console


    http://localhost:50070/ – Web UI of the NameNode daemon

    http://localhost:50030/ – Web UI of the JobTracker daemon
    http://localhost:50060/ – Web UI of the TaskTracker daemon


Configuring WSO2 BAM with External Hadoop


1. Download WSO2 BAM distribution from here and and unzip it. 

2. Modify the WSO2BAM_DATASOURCE in <BAM_HOME>/repository/conf/datasources/bam-datasources.xml file. WSO2BAM_DATASOURCE is the default data source available in BAM and it should be configured to connected with the summary RDBMS database you are using. Be sure to change the database URL and  credentials according to your environment. This example I am using a MySQL database named BAM_STATS_DB to store BAM summary data.

<datasource>
    <name>WSO2BAM_DATASOURCE</name>
    <description>The datasource used for analyzer data</description>
            <definition type="RDBMS">
                <configuration>
                    <url>jdbc:mysql://127.0.0.1:3306/BAM_STATS_DB?autoReconnect=true&amp;relaxAutoCommit=true</url>
                    <username>root</username>
                    <password>root</password>
                    <driverClassName>com.mysql.jdbc.Driver</driverClassName>
                    <maxActive>50</maxActive>
                    <maxWait>60000</maxWait>
                    <testOnBorrow>true</testOnBorrow>
                    <validationQuery>SELECT 1</validationQuery>
                    <validationInterval>30000</validationInterval>
                </configuration>
            </definition>
        </datasource>

3. Add the mysql connector jar(mysql-connector-java-5.1.28-bin.jar) to <BAM_HOME>/repository/components/lib directory.

4. Modify <BAM_HOME>/repository/conf/advanced/hive-site.xml as follows. It has a jar file name added to hive.aux.jars.path property to include mysql connector JAR in Hadoop job execution runtime.

<property>
  <name>hadoop.embedded.local.mode</name>
  <value>false</value>
</property>
 
<property>
  <name>hive.metastore.warehouse.dir</name>
  <value>/user/hive/warehouse</value>
  <description>location of default database for the warehouse</description>
</property>
 
<property>
  <name>fs.default.name</name>
  <value>hdfs://localhost:9000</value>
</property>

<property>
  <name>mapred.job.tracker</name>
  <value>localhost:9001</value>
</property>
 
<property>    
  <name>hive.aux.jars.path</name>
  <value>file://${CARBON_HOME}/repository/components/plugins/apache-cassandra_1.2.13.wso2v4.jar,file://${CARBON_HOME}/repository/components/plugins/guava_12.0.0.wso2v1.jar,file://${CARBON_HOME}/repository/components/plugins/json_2.0.0.wso2v1.jar,file://${CARBON_HOME}/repository/components/plugins/commons-dbcp_1.4.0.wso2v1.jar,file://${CARBON_HOME}/repository/components/plugins/commons-pool_1.5.6.wso2v1.jar,file://${CARBON_HOME}/repository/components/plugins/libthrift_0.7.0.wso2v2.jar,file://${CARBON_HOME}/repository/components/plugins/hector-core_1.1.4.wso2v1.jar,file://${CARBON_HOME}/repository/components/plugins/org.wso2.carbon.bam.cassandra.data.archive_4.2.2.jar,file://${CARBON_HOME}/repository/components/lib/mysql-connector-java-5.1.28-bin.jar</value>
</property>

5. Everything is configured now. Try out one of the samples to verify the BAM functionality. Check the Hadoop logs(<HADOOP_HOME>/logs) if there are errors related to the analytics part. Hadoop Web Console is also useful too. 

1 comment:

  1. Thank you so much for sharing this worthwhile to spent time on. You are running a really awesome blog. Keep up this good work

    Hadoop training velachery
    Hadoop training institute in t nagar
    Hadoop course in t nagar

    ReplyDelete