Apache Samza is an open source  and distributed stream processing framework. It uses there are two major packages Apache Kafka and Apache Hadoop. 

Apache Kafka is used for messaging

Apache Hadoop YARN provides fault tolerance, processor isolation, security, and resource management.

This post describes how to install and run your first Samza job Ubuntu 14.04 with 32 bit system.

Prerequisites:

The following packages are required to be install and configure Apache-Samza

JDK 1.7
maven2

kafka
yarn
zookeeper

#  apt-get install curl gem 


Download and set JDK Path:

We need to install the JDK and set Environment Variable path.


# cd /usr/java

# wget --no-cookies --no-check-certificate --header "Cookie: gpw_e24=http%3A%2F%2Fwww.oracle.com%2F; oraclelicense=accept-securebackup-cookie" "http://download.oracle.com/otn-pub/java/jdk/7u79-b15/jdk-7u79-linux-i586.tar.gz"

# tar xzf jdk-7u79-linux-i586.tar.gz


Extract  and set the JAVA_HOME path

# tar -zxvf  jdk-7u79-linux-i586.tar.gz
# JAVA_HOME=/usr/java/jdk1.7.0_79
# export JAVA_HOME
# PATH=$JAVA_HOME/bin:$PATH
# export PATH

Then, add those line into ~/.bashrc and /etc/bashrc

Install Maven2:

Next, download the maven package and install it 

#  wget https://launchpad.net/~bneijt/+archive/ubuntu/ppa/+build/2139203/+files/maven3_3.0.1-0~ppa2_all.deb


# dpkg -i maven3_3.0.1-0~ppa2_all.deb 


Check your maven version

#  mvn3 -version
Apache Maven 3.0.1 (r1038046; 2010-11-23 16:28:32+0530)
Java version: 1.7.0_79
Java home: /usr/java/jdk1.7.0_79/jre
Default locale: en_IN, platform encoding: UTF-8
OS name: "linux" version: "3.8.0-29-generic" arch: "i386" Family: "unix"

InstallHello-Samza : 


Let install this under the /usr/local directory, so change the directory to

# cd /usr/local


Clone the hello-samza package,

# git clone git://git.apache.org/samza-hello-samza.git hello-samza


This project contains everything with a script called "grid" available in hello-samza. It help you setup Kafka, Yarn and Zookeeper and start by running.

just run the below commands,

# cd /usr/local/hello-samza


 root@dev:/usr/local/hello-samza# bin/grid install kafka 
EXECUTING: install kafka
Downloading kafka_2.10-0.8.2.1.tgz...
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
 15 15.4M   15 2406k    0     0   304k      0  0:00:51  0:00:07  0:00:44  443k

 root@dev:/usr/local/hello-samza# bin/grid install yarn 
EXECUTING: install yarn
Downloading hadoop-2.6.1.tar.gz...
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
 77  187M   77  145M    0     0   239k      0  0:13:23  0:10:22  0:03:01  204k

 root@dev:/usr/local/hello-samza#  bin/grid install zookeeper
EXECUTING: install zookeeper
Downloading zookeeper-3.4.3.tar.gz...
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
  8 15.4M    8 1324k    0     0   212k      0  0:01:14  0:00:06  0:01:08  266k

Now, you can see all package files are in a sub-directory called “deploy” inside hello-samza’s root folder.

root@dev:/usr/local/hello-samza# cd deploy

root@dev:/usr/local/hello-samza/deploy# ls 
kafka  yarn  zookeeper

execute the bin/grid bootstrap command

root@dev:/usr/local/hello-samza# bin/grid bootstrap 
Download http://repo1.maven.org/maven2/org/fusesource/scalate/scalate-util_2.10/1.6.1/scalate-util_2.10-1.6.1.jar
:samza-yarn_2.10:processResources
:samza-yarn_2.10:classes
:samza-yarn_2.10:lesscss
....
....
BUILD SUCCESSFUL

Total time: 20 mins 32.855 secs
/usr/local/hello-samza
EXECUTING: install zookeeper
Using previously downloaded file /root/.samza/download/zookeeper-3.4.3.tar.gz
EXECUTING: install yarn
Using previously downloaded file /root/.samza/download/hadoop-2.6.1.tar.gz
EXECUTING: install kafka
Using previously downloaded file /root/.samza/download/kafka_2.10-0.8.2.1.tgz
EXECUTING: start zookeeper
JMX enabled by default
Using config: /usr/local/hello-samza/deploy/zookeeper/bin/../conf/zoo.cfg
Starting zookeeper ... STARTED
EXECUTING: start yarn
starting resourcemanager, logging to /usr/local/hello-samza/deploy/yarn/logs/yarn-root-resourcemanager-dev.out
starting nodemanager, logging to /usr/local/hello-samza/deploy/yarn/logs/yarn-root-nodemanager-dev.out
EXECUTING: start kafka
 

Once the grid command completes, you can verify that YARN is up and running, Open the URL http://localhost:8088. This is the YARN UI.

Build a Samza Job Package:

You need to build the packages for it, this package is what YARN uses to deploy your jobs on the grid. 

NOTE: For example if you are building from the latest branch of hello-samza project, make sure that you run the following step from your local Samza project first:

root@dev:/usr/local/hello-samza#./gradlew publishToMavenLocal 


Then, you can continue w/ the following command in hello-samza project:

root@dev:/usr/local/hello-samza# mvn clean package

root@dev:/usr/local/hello-samza# mkdir -p deploy/samza

root@dev:/usr/local/hello-samza# tar -xvf ./target/hello-samza-0.10.0-dist.tar.gz -C deploy/samza


Run a Samza Job:

Once completes built your Samza package, then you can run a job on the grid using the script run-job.sh.

root@dev:/usr/local/hello-samza # deploy/samza/bin/run-job.sh --config-factory=org.apache.samza.config.factories.PropertiesConfigFactory --config-path=file://$PWD/deploy/samza/config/wikipedia-feed.properties 
 

The below job will consume a feed of real-time edits from Wikipedia, and produce them to a Kafka topic called thelinuxfaq-raw”

Give the job a minute to startup, and then tail the Kafka topic:

root@dev:/usr/local/hello-samza#  deploy/kafka/bin/kafka-console-consumer.sh  --zookeeper localhost:2181 --topic thelinuxfaq-raw


 Now, you can check out the YARN UI again (http://localhost:8088).  without any error, you’ll see your Samza job is running! 

Shutdown Samza:

After you done, you can stop everything up using the same grid script.
root@dev:/usr/local/hello-samza #  bin/grid stop all 

Sample Output:

EXECUTING: stop all
EXECUTING: stop kafka
EXECUTING: stop yarn
stopping resourcemanager
stopping nodemanager
EXECUTING: stop zookeeper
JMX enabled by default
Using config: /usr/local/hello-samza/deploy/zookeeper/bin/../conf/zoo.cfg
Stopping zookeeper ... STOPPED

Start Samza :

Also, you can start up everything using the same grid script,

root@dev:/usr/local/hello-samza #  bin/grid start all 

Sample Output:
 EXECUTING: start all
EXECUTING: start zookeeper
JMX enabled by default
Using config: /usr/local/hello-samza/deploy/zookeeper/bin/../conf/zoo.cfg
Starting zookeeper ... STARTED
EXECUTING: start yarn
....
EXECUTING: start kafka