Sunday, January 4, 2015

Setting up Hadoop in pseudo distributed mode and integrate with WSO2 BAM

One of the most fundamental tasks of WSO2 BAM is data analysis. WSO2 BAM implements data analysis using an Apache Hadoop-based big data analytics framework, which uses the highly-scalable, MapReduce technology underneath it. 

By default, BAM ships with an embedded Hadoop instance (running in local mode) which starts and stops with BAM. But when running on a production environment, it is recommended to configure BAM with an external multi-node Hadoop cluster which is highly available and scalable. Read here for more information on setting up BAM with multi-node Hadoop cluster.

Other than the embedded mode and fully distributed(multi-node) mode Hadoop can also run in pseudo-distributed mode where each Hadoop daemon(NameNode, Secondary NameNode, DataNode, JobTracker, TaskTracker) runs in a separate Java process. 

In this post, I am going to discuss how to setup Hadoop in pseudo-distributed mode and integrate with WSO2 BAM. This will be useful if you want to simulate a BAM + Hadoop cluster on a small scale or if you are in a need of connecting BAM to external Hadoop cluster but you have only one server available.

Installing & Configuring Hadoop in Pseudo Distributed Mode


1. Install Java in a location that all the user groups can access

Java location used in this example is /usr/local/java/jdk1.6.0_38.  

2. Create a dedicated system user

- Creating group 'hadoop'.
    $ sudo addgroup hadoop

Adding a new user 'hduser' into group 'hadoop'
    $ sudo adduser --ingroup hadoop hduser

3. Configuring SSH to localhost

- Install OpenSSH if not installed already
    $ sudo apt-get install openssh-server

- Login as hduser
    $ su - hduser

- Generate a passphrase less SSH public key for Hadoop.
    $ ssh-keygen -t rsa -P ‘’

- Append this public key to the authorized_keys file.
    $ cat $HOME/.ssh/id_rsa.pub >> $HOME/.ssh/authorized_keys

- Verify passwordless ssh configuration using the command.
    $ ssh localhost

4. Installing Hadoop

- Exit from from hduser if you have logged in as hduser.
    $ exit

- Download Hadoop 1.2.1 distribution from here.

- Extract Hadoop distribution
    $ tar -zxvf hadoop-1.2.1.tar.gz

- Move it to a place where all users can access
    $ sudo mv hadoop-1.2.1 /usr/local/

- Give ownership to user 'hduser' 
    $ cd /usr/local
    $ chown -R hduser:hadoop hadoop-1.2.1

5. Setting JAVA_HOME/PATH variables for user 'hduser'

- Login as hduser
    $ su - hduser

- Add the following lines to the end of the $HOME/.bashrc file of user 'hduser'
    export JAVA_HOME=/usr/local/java/jdk1.6.0_38
    export PATH=$PATH:$JAVA_HOME/bin

6. Configuring Hadoop

- Define JAVA_HOME in <HADOOP_HOME>/conf/hadoop-env.sh file:
    export JAVA_HOME=/usr/local/java/jdk1.6.0_38

- Edit the <HADOOP_HOME>/conf/core-site.xml file as follows:

<configuration> <property> <name>fs.default.name</name> <value>hdfs://localhost:9000</value> </property> <property> <name>fs.hdfs.impl</name> <value>org.apache.hadoop.hdfs.DistributedFileSystem</value> <description>The FileSystem for hdfs: uris.</description> </property> <property> <name>hadoop.tmp.dir</name> <value>/home/hduser/hdfstmp</value> </property> </configuration>

- Make hdfstmp directory from hduser user home dir
    $ mkdir hdfstmp

- Edit the <HADOOP_HOME>/conf/hdfs-site.xml as follows:

<configuration> <property> <name>dfs.replication</name> <value>1</value> </property> <property> <name>dfs.name.dir</name> <value>/usr/local/hadoop-1.2.1/dfs/name</value> </property> <property> <name>dfs.data.dir</name> <value>/usr/local/hadoop-1.2.1/dfs/data</value> </property> <property> <name>dfs.permissions</name> <value>false</value> </property> </configuration>

- Edit <HADOOP_HOME>/conf/mapred-site.xml as follows:

<configuration> <property> <name>mapred.job.tracker</name> <value>localhost:9001</value> </property> <property> <name>mapred.system.dir</name> <value>/usr/local/hadoop-1.2.1/mapred/system</value> </property> <property> <name>mapred.local.dir</name> <value>/usr/local/hadoop-1.2.1/mapred/local</value> </property> </configuration>

7. Formatting HDFS Namenode

- Login as hduser if you haven't already logged in.
    $ su - hduser

- From the Hadoop installation directory, execute the following command to format the  namenode: 
    $ bin/hadoop namenode -format

Note - This formats the HDFS namenode file system at the first run. You'll have to do this ONLY ONCE.

8. Starting and Stopping Hadoop

- Start the Hadoop cluster from the <HADOOP_HOME>/bin directory using the command: 

    $ sh start-all.sh

Note: Above command will start all Hadoop daemons simultaneously. Check the Hadoop daemon logs found at <HADOOP_HOME>/logs directory to check whether everything works perfectly. 

- To stop the Hadoop cluster, execute the following commane from the <HADOOP_HOME>/bin directory:

    $ sh stop-all.sh

9. Hadoop Web Console


    http://localhost:50070/ – Web UI of the NameNode daemon

    http://localhost:50030/ – Web UI of the JobTracker daemon
    http://localhost:50060/ – Web UI of the TaskTracker daemon


Configuring WSO2 BAM with External Hadoop


1. Download WSO2 BAM distribution from here and and unzip it. 

2. Modify the WSO2BAM_DATASOURCE in <BAM_HOME>/repository/conf/datasources/bam-datasources.xml file. WSO2BAM_DATASOURCE is the default data source available in BAM and it should be configured to connected with the summary RDBMS database you are using. Be sure to change the database URL and  credentials according to your environment. This example I am using a MySQL database named BAM_STATS_DB to store BAM summary data.

<datasource>
    <name>WSO2BAM_DATASOURCE</name>
    <description>The datasource used for analyzer data</description>
            <definition type="RDBMS">
                <configuration>
                    <url>jdbc:mysql://127.0.0.1:3306/BAM_STATS_DB?autoReconnect=true&amp;relaxAutoCommit=true</url>
                    <username>root</username>
                    <password>root</password>
                    <driverClassName>com.mysql.jdbc.Driver</driverClassName>
                    <maxActive>50</maxActive>
                    <maxWait>60000</maxWait>
                    <testOnBorrow>true</testOnBorrow>
                    <validationQuery>SELECT 1</validationQuery>
                    <validationInterval>30000</validationInterval>
                </configuration>
            </definition>
        </datasource>

3. Add the mysql connector jar(mysql-connector-java-5.1.28-bin.jar) to <BAM_HOME>/repository/components/lib directory.

4. Modify <BAM_HOME>/repository/conf/advanced/hive-site.xml as follows. It has a jar file name added to hive.aux.jars.path property to include mysql connector JAR in Hadoop job execution runtime.

<property>
  <name>hadoop.embedded.local.mode</name>
  <value>false</value>
</property>
 
<property>
  <name>hive.metastore.warehouse.dir</name>
  <value>/user/hive/warehouse</value>
  <description>location of default database for the warehouse</description>
</property>
 
<property>
  <name>fs.default.name</name>
  <value>hdfs://localhost:9000</value>
</property>

<property>
  <name>mapred.job.tracker</name>
  <value>localhost:9001</value>
</property>
 
<property>    
  <name>hive.aux.jars.path</name>
  <value>file://${CARBON_HOME}/repository/components/plugins/apache-cassandra_1.2.13.wso2v4.jar,file://${CARBON_HOME}/repository/components/plugins/guava_12.0.0.wso2v1.jar,file://${CARBON_HOME}/repository/components/plugins/json_2.0.0.wso2v1.jar,file://${CARBON_HOME}/repository/components/plugins/commons-dbcp_1.4.0.wso2v1.jar,file://${CARBON_HOME}/repository/components/plugins/commons-pool_1.5.6.wso2v1.jar,file://${CARBON_HOME}/repository/components/plugins/libthrift_0.7.0.wso2v2.jar,file://${CARBON_HOME}/repository/components/plugins/hector-core_1.1.4.wso2v1.jar,file://${CARBON_HOME}/repository/components/plugins/org.wso2.carbon.bam.cassandra.data.archive_4.2.2.jar,file://${CARBON_HOME}/repository/components/lib/mysql-connector-java-5.1.28-bin.jar</value>
</property>

5. Everything is configured now. Try out one of the samples to verify the BAM functionality. Check the Hadoop logs(<HADOOP_HOME>/logs) if there are errors related to the analytics part. Hadoop Web Console is also useful too. 

Friday, January 2, 2015

How to package a Dashboard created using WSO2 BAM Gadget Generation Tool in a BAM Toolbox

WSO2 BAM comes with a gadget generation tool which has a step by step wizard that allows you to generate gadgets for your dashboard in a few simple steps. Read here for more information.

In this post, I am going to discuss how to package a dashboard created using Gadget Gen Tool in a BAM Toolbox. It is important to do that because otherwise it will be very difficult to add the created dashboard to a different BAM server instance. If you have Toolbox created, it is just a matter of installing the Toolbox. 

Let's assume you have created a dashboard which consists of two gadgets named brand_vs_qty and sales_summary. In my previous post I have discussed on how to create such dashboard which looks like below.



Follow below steps to package above dashboard in a toolbox.


1. First, create a empty folders and files in following structure. 

BAM_Toolbox<folder>
└── dashboard<folder>
         ├── dashboard.properties<file>
         ├── gadgets<folder>
         └── jaggery<folder>
                  └── gadgetgen<folder>


Note : Refer the structure of sample  HTTPD_Logs.tbox and KPI_Phone_Retail_Store.tbox that are packed with BAM by default. (Can be found at <BAM_HOME/samples/toolboxes>)

2. Copy two jaggery files(brand_vs_qty.jag, sales_summary.jag) which was generated above from <BAM_HOME>/repository/deployment/server/jaggeryapps/gadgetgen to BAM_Toolbox/dashboard/jaggery/gadgetgen

3. Modify the above jaggery files(brand_vs_qty.jag, sales_summary.jag) to use the datasource name of BAM summary DB. By default Gadget Gen Tool add DB credentials provided in Gadget Gen Wizard, so it is important to change it to use the datasource name, because otherwise we will have to change DB credentials manually. 

        try {
           db = new Database("WSO2BAM_DATASOURCE");
        result = db.query(" select * from brandSummary");
        } finally {
           if (db != null) {
               db.close();
           }
        }

3. Download Gadget XMLs brand_vs_qty.xml and sales_summary.xml from WSO2 BAM Configuration Registry.
Path - /_system/config/repository/components/org.wso2.carbon.bam.gadgetgen/gadgetgen 



To download the .xml files click on each of them and click 'Download' Link



4. Copy above downloaded files(brand_vs_qty.xml and sales_summary.xml) to BAM_Toolbox/dashboard/gadgets/ folder

5. Open brand_vs_qty.xml file using a text editor. Modify the 'url' value in update() function as follows. By default Gadget Gen Tool add localhost ip and port to URL, so that needs to be modified in order to use relative path to support both super tenant and tenant modes.

            function update() {
                var respJson = null;
                var finalURL = "";
                if (window.location.pathname.search("/t/") == 0) {
                    finalURL = "./jaggeryapps/gadgetgen/brand_vs_qty.jag";
                } else {
                    finalURL = "../../gadgetgen/brand_vs_qty.jag";
                }
                $.ajax({
                    url: finalURL,
                    
                    dataType: 'json',

6. Open sales_summary.xml file using a text editor and modify the 'url' value in update() function as similar to the above.

            function update() {
                var respJson = null;
                var finalURL = "";
                if (window.location.pathname.search("/t/") == 0) {
                    finalURL = "./jaggeryapps/gadgetgen/sales_summary.jag";
                } else {
                    finalURL = "../../gadgetgen/sales_summary.jag";
                }
                $.ajax({
                    url: finalURL,

                    dataType: 'json',

7. We need to add required gadget related files (css, js, images, etc.) to our Toolbox. Extract existing KPI_Phone_Retail_Store.tbox (or HTTPD_Logs.tbox) and copy all three css, images, js folders found at KPI_Phone_Retail_Store/dashboard/gadgets to BAM_Toolbox/dashboard/gadgets/

7. Open dashboard.properties file and add following properties.

    dashboard.tabs=tab1
    dashboard.tabs.tab1.name=Phone Brands
    dashboard.tabs.tab1.gadgets=brand_vs_qty.xml,sales_summary.xml

8. Zip the BAM_Toolbox folder and rename extension to .tbox. So our toolbox name will be BAM_Toolbox.tbox. 

9. Now we have successfully package the created dashboard in a Toolbox. To make sure it works properly, install the toolbox in a fresh BAM pack and check the dashboard.

Thursday, January 1, 2015

How to use WSO2 BAM Gadget Generation Tool to Create a Dashboard

Here in this post I am going to discuss how to create a simple dashboard using WSO2 BAM gadget generation tool and package that dashboard in a BAM Toolbox.

Publishing Data

First we should publish some data to BAM. Here, I am going to use KPI sample to publish data. Open a command prompt (or a shell in Linux) and go to the Go to <BAM_HOME>/samples/kpi-definition directory. Then run the ant command to publish some sample data.

Analyzing Data

Now run below hive script to summarize the data that has been published. Instructions on adding and executing a hive script can be found in here.

CREATE EXTERNAL TABLE IF NOT EXISTS PhoneSalesTable 
(orderID STRING, brandName STRING, userName STRING, quantity INT, version STRING) STORED BY 'org.apache.hadoop.hive.cassandra.CassandraStorageHandler' WITH SERDEPROPERTIES (
  "wso2.carbon.datasource.name" = "WSO2BAM_CASSANDRA_DATASOURCE",
"cassandra.cf.name" = "org_wso2_bam_phone_retail_store_kpi" , 
"cassandra.columns.mapping" = 
":key,payload_brand, payload_user, payload_quantity, Version" );
CREATE EXTERNAL TABLE IF NOT EXISTS PhonebrandTable(brand STRING, totalOrders INT, totalQuantity INT) STORED BY 'org.wso2.carbon.hadoop.hive.jdbc.storage.JDBCStorageHandler' 
TBLPROPERTIES ( 
    'wso2.carbon.datasource.name'='WSO2BAM_DATASOURCE',
'hive.jdbc.update.on.duplicate' = 'true' , 
'hive.jdbc.primary.key.fields' = 'brand' , 
'hive.jdbc.table.create.query' = 
'CREATE TABLE brandSummary (brand VARCHAR(100) NOT NULL PRIMARY KEY, totalOrders  INT, totalQuantity INT)' );
insert overwrite table PhonebrandTable select brandName, count(DISTINCT orderID), sum(quantity) from PhoneSalesTable where version= "1.0.0" group by brandName;

Note : Read here for more information about BAM analytics framework.

Creating Gadgets

As you can see in the above Hive script, the summarized data is stored in the table named 'brandSummary' in summary database(RDBMS). Since H2 is the default packed summary database that comes with WSO2 BAM, the table named 'brandSummary' should be created in the H2 database.

Let's generate a couple of BAM dashboard gadgets according to the instruction given here.

1. Go to the Tools → Gadget Gen Tool in BAM Management Console. Then enter the values given below, which are related to the data source 'WSO2BAM_DATASOURCE'. 

    JDBC URL : jdbc:h2:repository/database/samples/BAM_STATS_DB;AUTO_SERVER=TRUE
    Driver Class Name : org.h2.Driver
    User Name : wso2carbon
    Password : wso2carbon


2. In the next step, add the following SQL script.

    select * from brandSummary


3. Enter as follows in the next step.

    Pick UI Element : Bar Graph
    Chart Title : Phone Brand Vs Quantity
    Y-Axis Label : Quantity
    Y-Axis Column : TOTALQUANTITY
    X-Axis Label : Phone Brand
    X-Axis Column : BRAND


4. In the next step, enter as follows.

    Gadget Title : Phone Brand Vs Quantity
    Gadget File Name : brand_vs_qty
    Refresh Rate (in Seconds) : 60


Then click on the Generate button and follow the steps mentioned in 'Add gadget to dashboard' section of the doc here to add the gadget to dashboard.

5. Create another gadget by selecting 'Table' UI Element (in above given Step 3) as follows. Use same details given in Step 1 and Step 2. 
















Generate the gadget and add it to the same tab of the dashboard.  Your final dashboard will looks like follows.