Sunday, January 4, 2015

Setting up Hadoop in pseudo distributed mode and integrate with WSO2 BAM

One of the most fundamental tasks of WSO2 BAM is data analysis. WSO2 BAM implements data analysis using an Apache Hadoop-based big data analytics framework, which uses the highly-scalable, MapReduce technology underneath it. 

By default, BAM ships with an embedded Hadoop instance (running in local mode) which starts and stops with BAM. But when running on a production environment, it is recommended to configure BAM with an external multi-node Hadoop cluster which is highly available and scalable. Read here for more information on setting up BAM with multi-node Hadoop cluster.

Other than the embedded mode and fully distributed(multi-node) mode Hadoop can also run in pseudo-distributed mode where each Hadoop daemon(NameNode, Secondary NameNode, DataNode, JobTracker, TaskTracker) runs in a separate Java process. 

In this post, I am going to discuss how to setup Hadoop in pseudo-distributed mode and integrate with WSO2 BAM. This will be useful if you want to simulate a BAM + Hadoop cluster on a small scale or if you are in a need of connecting BAM to external Hadoop cluster but you have only one server available.

Installing & Configuring Hadoop in Pseudo Distributed Mode


1. Install Java in a location that all the user groups can access

Java location used in this example is /usr/local/java/jdk1.6.0_38.  

2. Create a dedicated system user

- Creating group 'hadoop'.
    $ sudo addgroup hadoop

Adding a new user 'hduser' into group 'hadoop'
    $ sudo adduser --ingroup hadoop hduser

3. Configuring SSH to localhost

- Install OpenSSH if not installed already
    $ sudo apt-get install openssh-server

- Login as hduser
    $ su - hduser

- Generate a passphrase less SSH public key for Hadoop.
    $ ssh-keygen -t rsa -P ‘’

- Append this public key to the authorized_keys file.
    $ cat $HOME/.ssh/id_rsa.pub >> $HOME/.ssh/authorized_keys

- Verify passwordless ssh configuration using the command.
    $ ssh localhost

4. Installing Hadoop

- Exit from from hduser if you have logged in as hduser.
    $ exit

- Download Hadoop 1.2.1 distribution from here.

- Extract Hadoop distribution
    $ tar -zxvf hadoop-1.2.1.tar.gz

- Move it to a place where all users can access
    $ sudo mv hadoop-1.2.1 /usr/local/

- Give ownership to user 'hduser' 
    $ cd /usr/local
    $ chown -R hduser:hadoop hadoop-1.2.1

5. Setting JAVA_HOME/PATH variables for user 'hduser'

- Login as hduser
    $ su - hduser

- Add the following lines to the end of the $HOME/.bashrc file of user 'hduser'
    export JAVA_HOME=/usr/local/java/jdk1.6.0_38
    export PATH=$PATH:$JAVA_HOME/bin

6. Configuring Hadoop

- Define JAVA_HOME in <HADOOP_HOME>/conf/hadoop-env.sh file:
    export JAVA_HOME=/usr/local/java/jdk1.6.0_38

- Edit the <HADOOP_HOME>/conf/core-site.xml file as follows:

<configuration> <property> <name>fs.default.name</name> <value>hdfs://localhost:9000</value> </property> <property> <name>fs.hdfs.impl</name> <value>org.apache.hadoop.hdfs.DistributedFileSystem</value> <description>The FileSystem for hdfs: uris.</description> </property> <property> <name>hadoop.tmp.dir</name> <value>/home/hduser/hdfstmp</value> </property> </configuration>

- Make hdfstmp directory from hduser user home dir
    $ mkdir hdfstmp

- Edit the <HADOOP_HOME>/conf/hdfs-site.xml as follows:

<configuration> <property> <name>dfs.replication</name> <value>1</value> </property> <property> <name>dfs.name.dir</name> <value>/usr/local/hadoop-1.2.1/dfs/name</value> </property> <property> <name>dfs.data.dir</name> <value>/usr/local/hadoop-1.2.1/dfs/data</value> </property> <property> <name>dfs.permissions</name> <value>false</value> </property> </configuration>

- Edit <HADOOP_HOME>/conf/mapred-site.xml as follows:

<configuration> <property> <name>mapred.job.tracker</name> <value>localhost:9001</value> </property> <property> <name>mapred.system.dir</name> <value>/usr/local/hadoop-1.2.1/mapred/system</value> </property> <property> <name>mapred.local.dir</name> <value>/usr/local/hadoop-1.2.1/mapred/local</value> </property> </configuration>

7. Formatting HDFS Namenode

- Login as hduser if you haven't already logged in.
    $ su - hduser

- From the Hadoop installation directory, execute the following command to format the  namenode: 
    $ bin/hadoop namenode -format

Note - This formats the HDFS namenode file system at the first run. You'll have to do this ONLY ONCE.

8. Starting and Stopping Hadoop

- Start the Hadoop cluster from the <HADOOP_HOME>/bin directory using the command: 

    $ sh start-all.sh

Note: Above command will start all Hadoop daemons simultaneously. Check the Hadoop daemon logs found at <HADOOP_HOME>/logs directory to check whether everything works perfectly. 

- To stop the Hadoop cluster, execute the following commane from the <HADOOP_HOME>/bin directory:

    $ sh stop-all.sh

9. Hadoop Web Console


    http://localhost:50070/ – Web UI of the NameNode daemon

    http://localhost:50030/ – Web UI of the JobTracker daemon
    http://localhost:50060/ – Web UI of the TaskTracker daemon


Configuring WSO2 BAM with External Hadoop


1. Download WSO2 BAM distribution from here and and unzip it. 

2. Modify the WSO2BAM_DATASOURCE in <BAM_HOME>/repository/conf/datasources/bam-datasources.xml file. WSO2BAM_DATASOURCE is the default data source available in BAM and it should be configured to connected with the summary RDBMS database you are using. Be sure to change the database URL and  credentials according to your environment. This example I am using a MySQL database named BAM_STATS_DB to store BAM summary data.

<datasource>
    <name>WSO2BAM_DATASOURCE</name>
    <description>The datasource used for analyzer data</description>
            <definition type="RDBMS">
                <configuration>
                    <url>jdbc:mysql://127.0.0.1:3306/BAM_STATS_DB?autoReconnect=true&amp;relaxAutoCommit=true</url>
                    <username>root</username>
                    <password>root</password>
                    <driverClassName>com.mysql.jdbc.Driver</driverClassName>
                    <maxActive>50</maxActive>
                    <maxWait>60000</maxWait>
                    <testOnBorrow>true</testOnBorrow>
                    <validationQuery>SELECT 1</validationQuery>
                    <validationInterval>30000</validationInterval>
                </configuration>
            </definition>
        </datasource>

3. Add the mysql connector jar(mysql-connector-java-5.1.28-bin.jar) to <BAM_HOME>/repository/components/lib directory.

4. Modify <BAM_HOME>/repository/conf/advanced/hive-site.xml as follows. It has a jar file name added to hive.aux.jars.path property to include mysql connector JAR in Hadoop job execution runtime.

<property>
  <name>hadoop.embedded.local.mode</name>
  <value>false</value>
</property>
 
<property>
  <name>hive.metastore.warehouse.dir</name>
  <value>/user/hive/warehouse</value>
  <description>location of default database for the warehouse</description>
</property>
 
<property>
  <name>fs.default.name</name>
  <value>hdfs://localhost:9000</value>
</property>

<property>
  <name>mapred.job.tracker</name>
  <value>localhost:9001</value>
</property>
 
<property>    
  <name>hive.aux.jars.path</name>
  <value>file://${CARBON_HOME}/repository/components/plugins/apache-cassandra_1.2.13.wso2v4.jar,file://${CARBON_HOME}/repository/components/plugins/guava_12.0.0.wso2v1.jar,file://${CARBON_HOME}/repository/components/plugins/json_2.0.0.wso2v1.jar,file://${CARBON_HOME}/repository/components/plugins/commons-dbcp_1.4.0.wso2v1.jar,file://${CARBON_HOME}/repository/components/plugins/commons-pool_1.5.6.wso2v1.jar,file://${CARBON_HOME}/repository/components/plugins/libthrift_0.7.0.wso2v2.jar,file://${CARBON_HOME}/repository/components/plugins/hector-core_1.1.4.wso2v1.jar,file://${CARBON_HOME}/repository/components/plugins/org.wso2.carbon.bam.cassandra.data.archive_4.2.2.jar,file://${CARBON_HOME}/repository/components/lib/mysql-connector-java-5.1.28-bin.jar</value>
</property>

5. Everything is configured now. Try out one of the samples to verify the BAM functionality. Check the Hadoop logs(<HADOOP_HOME>/logs) if there are errors related to the analytics part. Hadoop Web Console is also useful too. 

Friday, January 2, 2015

How to package a Dashboard created using WSO2 BAM Gadget Generation Tool in a BAM Toolbox

WSO2 BAM comes with a gadget generation tool which has a step by step wizard that allows you to generate gadgets for your dashboard in a few simple steps. Read here for more information.

In this post, I am going to discuss how to package a dashboard created using Gadget Gen Tool in a BAM Toolbox. It is important to do that because otherwise it will be very difficult to add the created dashboard to a different BAM server instance. If you have Toolbox created, it is just a matter of installing the Toolbox. 

Let's assume you have created a dashboard which consists of two gadgets named brand_vs_qty and sales_summary. In my previous post I have discussed on how to create such dashboard which looks like below.



Follow below steps to package above dashboard in a toolbox.


1. First, create a empty folders and files in following structure. 

BAM_Toolbox<folder>
└── dashboard<folder>
         ├── dashboard.properties<file>
         ├── gadgets<folder>
         └── jaggery<folder>
                  └── gadgetgen<folder>


Note : Refer the structure of sample  HTTPD_Logs.tbox and KPI_Phone_Retail_Store.tbox that are packed with BAM by default. (Can be found at <BAM_HOME/samples/toolboxes>)

2. Copy two jaggery files(brand_vs_qty.jag, sales_summary.jag) which was generated above from <BAM_HOME>/repository/deployment/server/jaggeryapps/gadgetgen to BAM_Toolbox/dashboard/jaggery/gadgetgen

3. Modify the above jaggery files(brand_vs_qty.jag, sales_summary.jag) to use the datasource name of BAM summary DB. By default Gadget Gen Tool add DB credentials provided in Gadget Gen Wizard, so it is important to change it to use the datasource name, because otherwise we will have to change DB credentials manually. 

        try {
           db = new Database("WSO2BAM_DATASOURCE");
        result = db.query(" select * from brandSummary");
        } finally {
           if (db != null) {
               db.close();
           }
        }

3. Download Gadget XMLs brand_vs_qty.xml and sales_summary.xml from WSO2 BAM Configuration Registry.
Path - /_system/config/repository/components/org.wso2.carbon.bam.gadgetgen/gadgetgen 



To download the .xml files click on each of them and click 'Download' Link



4. Copy above downloaded files(brand_vs_qty.xml and sales_summary.xml) to BAM_Toolbox/dashboard/gadgets/ folder

5. Open brand_vs_qty.xml file using a text editor. Modify the 'url' value in update() function as follows. By default Gadget Gen Tool add localhost ip and port to URL, so that needs to be modified in order to use relative path to support both super tenant and tenant modes.

            function update() {
                var respJson = null;
                var finalURL = "";
                if (window.location.pathname.search("/t/") == 0) {
                    finalURL = "./jaggeryapps/gadgetgen/brand_vs_qty.jag";
                } else {
                    finalURL = "../../gadgetgen/brand_vs_qty.jag";
                }
                $.ajax({
                    url: finalURL,
                    
                    dataType: 'json',

6. Open sales_summary.xml file using a text editor and modify the 'url' value in update() function as similar to the above.

            function update() {
                var respJson = null;
                var finalURL = "";
                if (window.location.pathname.search("/t/") == 0) {
                    finalURL = "./jaggeryapps/gadgetgen/sales_summary.jag";
                } else {
                    finalURL = "../../gadgetgen/sales_summary.jag";
                }
                $.ajax({
                    url: finalURL,

                    dataType: 'json',

7. We need to add required gadget related files (css, js, images, etc.) to our Toolbox. Extract existing KPI_Phone_Retail_Store.tbox (or HTTPD_Logs.tbox) and copy all three css, images, js folders found at KPI_Phone_Retail_Store/dashboard/gadgets to BAM_Toolbox/dashboard/gadgets/

7. Open dashboard.properties file and add following properties.

    dashboard.tabs=tab1
    dashboard.tabs.tab1.name=Phone Brands
    dashboard.tabs.tab1.gadgets=brand_vs_qty.xml,sales_summary.xml

8. Zip the BAM_Toolbox folder and rename extension to .tbox. So our toolbox name will be BAM_Toolbox.tbox. 

9. Now we have successfully package the created dashboard in a Toolbox. To make sure it works properly, install the toolbox in a fresh BAM pack and check the dashboard.

Thursday, January 1, 2015

How to use WSO2 BAM Gadget Generation Tool to Create a Dashboard

Here in this post I am going to discuss how to create a simple dashboard using WSO2 BAM gadget generation tool and package that dashboard in a BAM Toolbox.

Publishing Data

First we should publish some data to BAM. Here, I am going to use KPI sample to publish data. Open a command prompt (or a shell in Linux) and go to the Go to <BAM_HOME>/samples/kpi-definition directory. Then run the ant command to publish some sample data.

Analyzing Data

Now run below hive script to summarize the data that has been published. Instructions on adding and executing a hive script can be found in here.

CREATE EXTERNAL TABLE IF NOT EXISTS PhoneSalesTable 
(orderID STRING, brandName STRING, userName STRING, quantity INT, version STRING) STORED BY 'org.apache.hadoop.hive.cassandra.CassandraStorageHandler' WITH SERDEPROPERTIES (
  "wso2.carbon.datasource.name" = "WSO2BAM_CASSANDRA_DATASOURCE",
"cassandra.cf.name" = "org_wso2_bam_phone_retail_store_kpi" , 
"cassandra.columns.mapping" = 
":key,payload_brand, payload_user, payload_quantity, Version" );
CREATE EXTERNAL TABLE IF NOT EXISTS PhonebrandTable(brand STRING, totalOrders INT, totalQuantity INT) STORED BY 'org.wso2.carbon.hadoop.hive.jdbc.storage.JDBCStorageHandler' 
TBLPROPERTIES ( 
    'wso2.carbon.datasource.name'='WSO2BAM_DATASOURCE',
'hive.jdbc.update.on.duplicate' = 'true' , 
'hive.jdbc.primary.key.fields' = 'brand' , 
'hive.jdbc.table.create.query' = 
'CREATE TABLE brandSummary (brand VARCHAR(100) NOT NULL PRIMARY KEY, totalOrders  INT, totalQuantity INT)' );
insert overwrite table PhonebrandTable select brandName, count(DISTINCT orderID), sum(quantity) from PhoneSalesTable where version= "1.0.0" group by brandName;

Note : Read here for more information about BAM analytics framework.

Creating Gadgets

As you can see in the above Hive script, the summarized data is stored in the table named 'brandSummary' in summary database(RDBMS). Since H2 is the default packed summary database that comes with WSO2 BAM, the table named 'brandSummary' should be created in the H2 database.

Let's generate a couple of BAM dashboard gadgets according to the instruction given here.

1. Go to the Tools → Gadget Gen Tool in BAM Management Console. Then enter the values given below, which are related to the data source 'WSO2BAM_DATASOURCE'. 

    JDBC URL : jdbc:h2:repository/database/samples/BAM_STATS_DB;AUTO_SERVER=TRUE
    Driver Class Name : org.h2.Driver
    User Name : wso2carbon
    Password : wso2carbon


2. In the next step, add the following SQL script.

    select * from brandSummary


3. Enter as follows in the next step.

    Pick UI Element : Bar Graph
    Chart Title : Phone Brand Vs Quantity
    Y-Axis Label : Quantity
    Y-Axis Column : TOTALQUANTITY
    X-Axis Label : Phone Brand
    X-Axis Column : BRAND


4. In the next step, enter as follows.

    Gadget Title : Phone Brand Vs Quantity
    Gadget File Name : brand_vs_qty
    Refresh Rate (in Seconds) : 60


Then click on the Generate button and follow the steps mentioned in 'Add gadget to dashboard' section of the doc here to add the gadget to dashboard.

5. Create another gadget by selecting 'Table' UI Element (in above given Step 3) as follows. Use same details given in Step 1 and Step 2. 
















Generate the gadget and add it to the same tab of the dashboard.  Your final dashboard will looks like follows. 





















Saturday, January 18, 2014

Indexing in Cassandra to Support Events/Activity Search in WSO2 BAM

Latest WSO2 BAM (WSO2 BAM 2.4.0) supports events and activity search functionality. Here, I’m going to discuss about the Cassandra indexing techniques we have considered in order to facilitate our new WSO2 BAM big data search feature.

Indexing is essential to support events and activity search functionality. Since we are storing data in Cassandra Column Families, columns which search operations to be performed on has to be indexed. First, we thought of using projects like Solandra/Lucandra in order to implement our big data search feature. But since those projects are no longer active and big data search is somewhat beyond WSO2 BAM scope we have agreed to provide limited search capabilities with the use of inbuilt Cassandra secondary indexes. However, after we finalized the basic requirements we felt that secondary indexes alone may not be enough to support those requirements.

Followings were understood as the basic requirements in our search feature.
  1. Equality comparisons of indexed properties.
ex: bam_mediation_stats_data_publisher.host = '192.168.2.8'
  1. Inequality comparisions (Greater than/Less than operations).
ex: bam_mediation_stats_data_publisher.max_processing_time > 1000
  1. AND operation support.
ex: bam_mediation_stats_data_publisher.max_processing_time > 1000 AND bam_mediation_stats_data_publisher.host = '192.168.2.8'
  1. OR operation support.
ex: bam_mediation_stats_data_publisher.host = '192.168.2.8' OR bam_mediation_stats_data_publisher.host = '192.168.2.7'
  1. Get search results within specified time interval(understood as most commonly used in search queries).
ex: bam_mediation_stats_data_publisher.timestamp > start_time and  < end_time
  1. Need to support pagination.

Even though we initially thought of using secondary indexes, it has a lot of limitations in providing above requirements. So that I have looked into some of the other Cassandra custom indexing techniques as well. Find below the details of indexing models that I have come up with.

1. Native Secondary Indexes

Secondary indexes in Cassandra refer to indexes on column values. Cassandra implements secondary indexes as a hidden table, separate from the table that contains the values being indexed.

Pros
  • The row and index updates are one, atomic operation.
  • When we create a secondary index on an existing column, it indexes existing data in the background (without blocking reads or writes).
  • Querying can be done easily using IndexedSlicesQuery provided by hector
  • Pagination can also be done somewhat easily (setRowCount, get a batch, then setStartKey to be the last key in the previous batch, and get the next batch). But if there are OR operations to be performed then things get complex because only AND operation is supported in queries.
  • Query can have operations on none index properties as well(require at least one equality comparison of indexed property. Other operation performed in memory).

Cons
  • Not recommended for high cardinality values(i.e. timestamps, messageIDs, keywords, etc.).
  • Requires at least one equality comparison in a query.
  • Unsorted - results are in token order, not query value order.
  • Not great to query a huge volume of records for a small number of results.
  • Not great for less-than/greater-than/range queries.
  • The range operations get performed by in memory by coordinator node.

2. Keep a separate index column family per index columns of primary column family - a skinny row model.

In this model separate index column family created per index column. Values of the index columns(of primary CF) are the row keys of new index CF. Columns keys of index row keys are composite keys created using timestamp and original row key(Comparator type should be DynamicCompositeType). Column value is the original row key. There is a special row in index CF which keeps all the row keys of Index CF as column keys(we need this in order to support GT/LT operations on data). Below diagram depicts how the 'host' column is indexed using this model.


  1. Read an event from primary column family. (Basically a row from primary column family)
  2. If the event contain 'host'(index column) value, that value will be the row key of Index CF and column key for the particular row is composite value of timestamp and original row key.
  3. Insert the new row to Index CF. (If the row key already exists, add new column key to existing row)
  4. Add newly added row key value as a column key to indexRow.

Let's see how the search can be performed in above model.

Scenario 1
Search query --> 'bam_mediation_stats_data_publisher.host = '192.168.2.7' AND bam_mediation_stats_data_publisher.timestamp > start_time AND timestamp < end_time'

  1. To support above search we need to have Index CF created for 'host' column as explained above.
  2. Since this is an equality comparison we can do a range query on row(in Index CF) which the key value is '192.168.2.7' and get the original row keys in particular time range. (Here columns are sorted in timestamp so we can perform range query and get the rowkeys between the given time interval)
  3. Perform another look up in primary column family and get the actual rows using the above fetched resulting row keys.

Scenario 2
Search query --> 'bam_mediation_stats_data_publisher.count > 5 AND count < 10 AND bam_mediation_stats_data_publisher.timestamp > start_time AND < end_time'

  1. To support above query we need to have INDEX CF created for 'count' column as explained above
  2. Since this is an inequality comparison, we first need to perform a range query on our special indexRow (in index CF) and get the index row keys between 5 and 10 (Column keys are sorted).
  3. Then we need to perform range queries on individual resulting index row keys (Similar to step 2 in scenario 1)
  4. Perform another look up in primary column family and get the actual rows using the above fetched resulting row keys.

Scenario 3
Search query --> 'bam_mediation_stats_data_publisher.host = '192.168.2.7' AND 'bam_mediation_stats_data_publisher.count > 5 AND count < 10 AND bam_mediation_stats_data_publisher.timestamp > start_time AND < end_time'

  1. Here we have to follow the steps described in both above two scenarios and get the intersection.

Pros
  • Can overcome the limitations Cassandra native secondary indexes have. (Basically, using this model we can overcome all the cons that are mentioned for Native Secondary Indexes)

Cons
  • Index CF inserts has to be done manually.
  • Difficult to index existing data(in case we create the index on CFs which are already having data)
  • Index data is spread over many Cassandra nodes. The cluster need to fetch them all from different node and merge them. It has a cost.
  • Pagination can be done up to some extent. Can be supported for queries mentioned in above scenarios 1 and 2(by providing last result row key of previous page and page size), but has to be manually implemented for queries that are similar to one that is mentioned in scenario 3(here we have to fetch all the results (or limited result set) to our memory(or RDBMS) and manually support pagination)

3. Keep a separate index column family per index columns of primary column family - a wide row model

In this model, one row will keep all the index data as column keys(one row can have 2 billion columns). Composite key(DynamicComposite key) created using values of the index column, timestamp and the row key of primary CF. These composite keys are  inserted as a column keys of the index row of Index CF. (The composite columns are ordered first by its first component, then by its second component etc…). Column value will be the original row key. There should be a special row in index CF which keeps all unique index values(we need this in order to support GT/LT operations on data). Below diagram depicts how the 'host' column is indexed using this model.


  1. Read an event from primary column family. (Basically a row from primary column family)
  2. If the event contain host(index colum) value, create a composite key using the host value, timestamp and rowkey(DynamicComposite type).
  3. Add the column key to particular index row('INDEX_ROW') in Index CF. Also add the host value as a column key to our special INDEX_VALUE_ROW.

Let's see how the search can be performed with this model.

Scenario 1
Search query --> 'bam_mediation_stats_data_publisher.host = '192.168.2.7' AND bam_mediation_stats_data_publisher.timestamp > start_time AND timestamp < end_time'

  1. Results are retrieved by performing a range query on index row(INDEX_ROW). Here we have composite keys as columns keys. Since this is an equality comparison, first component of the composite column is fixed then we can filter by the second component(timestamp). For example, range start = 192.168.2.7:start_time and range end = 192.168.2.7:end_time
  2. Perform another look up in primary column family and get the actual rows using the above fetched resulting row keys.

Scenario 2
Search query --> 'bam_mediation_stats_data_publisher.count > 5 AND count < 10 AND bam_mediation_stats_data_publisher.timestamp > start_time AND < end_time'

  1. Since this is an inequality comparison, we first need to perform a range query on special INDEX_VALUE_ROW (in index CF) and get the index values between 5 and 10.
  2. Then we need to perform range queries on individual resulting index values(similarly as step 1 in scenario 1). For example, lets assume that the index values retrieved from above step 1 are {6,8}. Then, two range queries need to be performed as below on INDEX_ROW.
    1. i.  range start = 6:start_time and range end = 6:end_time
    2. ii. range start = 8:start_time and range end = 8:end_time
  3. Perform another look up in primary column family and get the actual rows using the above fetched resulting row keys.

scenario 3
Search query --> 'bam_mediation_stats_data_publisher.host = '192.168.2.7' AND 'bam_mediation_stats_data_publisher.count > 5 AND count < 10 AND bam_mediation_stats_data_publisher.timestamp > start_time AND < end_time'

  1. Here we have to follow the steps described in both above two scenarios and get the intersection

Pros
  • Can overcome the limitations Cassandra native secondary indexes have.
  • All columns of a row are stored on the same node, in the same data block and sorted on disk so accessing and scanning these columns is extremely fast.

Cons
  • Index CF inserts has to be done manually.
  • Difficult to index existing data(in case we create the index on CFs which are already having data)
  • Pagination can be done up to some extent. Can be supported for queries mentioned in above scenarios 1 and 2(by providing last result row key of previous page and page size), but has to be manually implemented for queries that are similar to one that is mentioned in scenario 3(here we have to fetch all the results (or limited result set) to our memory(or RDBMS) and manually support pagination)

You can see that we can overcome most of the limitations native secondary have by using other two custom indexing models. But still index insertion has to be done manually and it is very difficult create index for a CF which is already having data. Pagination support is also a concern.

By considering all the pros and cons, finally we have chosen above described indexing model 3(a wide row model) to facilitate our Cassandra data search feature. Arguably, it seems the best model of above described models. 


References
http://www.datastax.com/docs/1.1/ddl/indexes
http://chamibuddhika.wordpress.com/2011/11/27/cassandra-lessons-learnt/

Connecting to Cassandra Keyspaces Created by WSO2 BAM using CQL3 Commands

Here, I'm going to explain how to connect to Cassandra Keyspaces created by WSO2 BAM using CQL 3 Commands.

Most people able to connect to Cassandra Keypaces using CQL2 commands, but fail by using CQL3 commands. That is because, in CQL3, names are case insensitive by default, while they were case sensitive in CQL2. So you have to force whatever case you want in CQL3 by using double quotes.


You will be able to connect to Cassandra Keyspaces by using following sequence of commands.


..:/apache-cassandra-1.2.4/bin$ ./cqlsh -3 localhost 9160 -u admin -p admin
Connected to Test Cluster at localhost:9160.
[cqlsh 2.3.0 | Cassandra 1.1.3 | CQL spec 3.0.0 | Thrift protocol 19.32.0]
Use HELP for help.
cqlsh> use EVENT_KS;
cqlsh:EVENT_KS> select * from bam_mediation_stats_data_publisher;
Bad Request: Keyspace event_ks does not exist
Perhaps you meant to use CQL 2? Try using the -2 option when starting cqlsh.
cqlsh:EVENT_KS> USE "EVENT_KS";
cqlsh:EVENT_KS> select * from bam_mediation_stats_data_publisher;

key                                    | Description                      | Name                               | Nick_Name               | StreamId                                 | Timestamp     | Version | meta_host | payload_avg_processing_time | payload_count | payload_direction | payload_fault_count | payload_max_processing_time | payload_min_processing_time | payload_resource_id                 | payload_stats_type | payload_timestamp
----------------------------------------+----------------------------------+------------------------------------+-------------------------+------------------------------------------+---------------+---------+-----------+-----------------------------+---------------+-------------------+---------------------+-----------------------------+-----------------------------+-------------------------------------+--------------------+-------------------
1384321342409::10.100.0.40::9443::1232 | A sample for Mediator Statistics | bam_mediation_stats_data_publisher | MediationStatsDataAgent | bam_mediation_stats_data_publisher:1.0.0 | 1384321342409 |   1.0.0 | 127.0.0.3 |                       915.5 |             1 |                In |                   0 |                        1270 |                         561 |    Simple_Stock_Quote_Service_Proxy |              Proxy |     1386174113944