Apache Samza is an open source and distributed stream processing framework. It uses there are two major packages Apache Kafka and Apache Hadoop.
Apache Kafka is used for messaging
Apache Hadoop YARN provides fault tolerance, processor isolation, security, and resource management.
This post describes how to install and run your first Samza job Ubuntu 14.04 with 32 bit system.
Prerequisites:
The following packages are required to be install and configure Apache-Samza
JDK 1.7
maven2
kafka
yarn
zookeeper
# apt-get install curl gem
Download and set JDK Path:
We need to install the JDK and set Environment Variable path.
# cd /usr/java
# wget --no-cookies --no-check-certificate --header "Cookie: gpw_e24=http%3A%2F%2Fwww.oracle.com%2F; oraclelicense=accept-securebackup-cookie" "http://download.oracle.com/otn-pub/java/jdk/7u79-b15/jdk-7u79-linux-i586.tar.gz"
# tar xzf jdk-7u79-linux-i586.tar.gz
Extract and set the JAVA_HOME path
# tar -zxvf jdk-7u79-linux-i586.tar.gz
# JAVA_HOME=/usr/java/jdk1.7.0_79
# export JAVA_HOME
# PATH=$JAVA_HOME/bin:$PATH
# export PATH
Then, add those line into ~/.bashrc and /etc/bashrc
Install Maven2:
Next, download the maven package and install it
# wget https://launchpad.net/~bneijt/+archive/ubuntu/ppa/+build/2139203/+files/maven3_3.0.1-0~ppa2_all.deb
# dpkg -i maven3_3.0.1-0~ppa2_all.deb
Check your maven version
# mvn3 -version
Apache Maven 3.0.1 (r1038046; 2010-11-23 16:28:32+0530)
Java version: 1.7.0_79
Java home: /usr/java/jdk1.7.0_79/jre
Default locale: en_IN, platform encoding: UTF-8
OS name: "linux" version: "3.8.0-29-generic" arch: "i386" Family: "unix"
Java version: 1.7.0_79
Java home: /usr/java/jdk1.7.0_79/jre
Default locale: en_IN, platform encoding: UTF-8
OS name: "linux" version: "3.8.0-29-generic" arch: "i386" Family: "unix"
InstallHello-Samza :
Let install this under the /usr/local directory, so change the directory to
# cd /usr/local
Clone the hello-samza package,
# git clone git://git.apache.org/samza-hello-samza.git hello-samza
This project contains everything with a script called "grid" available in hello-samza. It help you setup Kafka, Yarn and Zookeeper and start by running.
just run the below commands,
# cd /usr/local/hello-samza
root@dev:/usr/local/hello-samza# bin/grid install kafka
EXECUTING: install kafka
Downloading kafka_2.10-0.8.2.1.tgz...
% Total % Received % Xferd Average Speed Time Time Time Current
Dload Upload Total Spent Left Speed
15 15.4M 15 2406k 0 0 304k 0 0:00:51 0:00:07 0:00:44 443k
Downloading kafka_2.10-0.8.2.1.tgz...
% Total % Received % Xferd Average Speed Time Time Time Current
Dload Upload Total Spent Left Speed
15 15.4M 15 2406k 0 0 304k 0 0:00:51 0:00:07 0:00:44 443k
root@dev:/usr/local/hello-samza# bin/grid install yarn
EXECUTING: install yarn
Downloading hadoop-2.6.1.tar.gz...
% Total % Received % Xferd Average Speed Time Time Time Current
Dload Upload Total Spent Left Speed
77 187M 77 145M 0 0 239k 0 0:13:23 0:10:22 0:03:01 204k
Downloading hadoop-2.6.1.tar.gz...
% Total % Received % Xferd Average Speed Time Time Time Current
Dload Upload Total Spent Left Speed
77 187M 77 145M 0 0 239k 0 0:13:23 0:10:22 0:03:01 204k
root@dev:/usr/local/hello-samza# bin/grid install zookeeper
EXECUTING: install zookeeper
Downloading zookeeper-3.4.3.tar.gz...
% Total % Received % Xferd Average Speed Time Time Time Current
Dload Upload Total Spent Left Speed
8 15.4M 8 1324k 0 0 212k 0 0:01:14 0:00:06 0:01:08 266k
Downloading zookeeper-3.4.3.tar.gz...
% Total % Received % Xferd Average Speed Time Time Time Current
Dload Upload Total Spent Left Speed
8 15.4M 8 1324k 0 0 212k 0 0:01:14 0:00:06 0:01:08 266k
Now, you can see all package files are in a sub-directory called “deploy” inside hello-samza’s root folder.
root@dev:/usr/local/hello-samza# cd deploy
root@dev:/usr/local/hello-samza/deploy# ls
kafka yarn zookeeper
execute the bin/grid bootstrap command
root@dev:/usr/local/hello-samza# bin/grid bootstrap
Download http://repo1.maven.org/maven2/org/fusesource/scalate/scalate-util_2.10/1.6.1/scalate-util_2.10-1.6.1.jar
:samza-yarn_2.10:processResources
:samza-yarn_2.10:classes
:samza-yarn_2.10:lesscss
....
....
BUILD SUCCESSFUL
Total time: 20 mins 32.855 secs
/usr/local/hello-samza
EXECUTING: install zookeeper
Using previously downloaded file /root/.samza/download/zookeeper-3.4.3.tar.gz
EXECUTING: install yarn
Using previously downloaded file /root/.samza/download/hadoop-2.6.1.tar.gz
EXECUTING: install kafka
Using previously downloaded file /root/.samza/download/kafka_2.10-0.8.2.1.tgz
EXECUTING: start zookeeper
JMX enabled by default
Using config: /usr/local/hello-samza/deploy/zookeeper/bin/../conf/zoo.cfg
Starting zookeeper ... STARTED
EXECUTING: start yarn
starting resourcemanager, logging to /usr/local/hello-samza/deploy/yarn/logs/yarn-root-resourcemanager-dev.out
starting nodemanager, logging to /usr/local/hello-samza/deploy/yarn/logs/yarn-root-nodemanager-dev.out
EXECUTING: start kafka
:samza-yarn_2.10:processResources
:samza-yarn_2.10:classes
:samza-yarn_2.10:lesscss
....
....
BUILD SUCCESSFUL
Total time: 20 mins 32.855 secs
/usr/local/hello-samza
EXECUTING: install zookeeper
Using previously downloaded file /root/.samza/download/zookeeper-3.4.3.tar.gz
EXECUTING: install yarn
Using previously downloaded file /root/.samza/download/hadoop-2.6.1.tar.gz
EXECUTING: install kafka
Using previously downloaded file /root/.samza/download/kafka_2.10-0.8.2.1.tgz
EXECUTING: start zookeeper
JMX enabled by default
Using config: /usr/local/hello-samza/deploy/zookeeper/bin/../conf/zoo.cfg
Starting zookeeper ... STARTED
EXECUTING: start yarn
starting resourcemanager, logging to /usr/local/hello-samza/deploy/yarn/logs/yarn-root-resourcemanager-dev.out
starting nodemanager, logging to /usr/local/hello-samza/deploy/yarn/logs/yarn-root-nodemanager-dev.out
EXECUTING: start kafka
Once the grid command completes, you can verify that YARN is up and running, Open the URL http://localhost:8088. This is the YARN UI.
Build a Samza Job Package:
You need to build the packages for it, this package is what YARN uses to deploy your jobs on the grid.
NOTE: For example if you are building from the latest branch of hello-samza project, make sure that you run the following step from your local Samza project first:
root@dev:/usr/local/hello-samza#./gradlew publishToMavenLocal
Then, you can continue w/ the following command in hello-samza project:
root@dev:/usr/local/hello-samza# mvn clean package
root@dev:/usr/local/hello-samza# mkdir -p deploy/samza
root@dev:/usr/local/hello-samza# tar -xvf ./target/hello-samza-0.10.0-dist.tar.gz -C deploy/samza
Run a Samza Job:
Once completes built your Samza package, then you can run a job on the grid using the script run-job.sh.
root@dev:/usr/local/hello-samza # deploy/samza/bin/run-job.sh --config-factory=org.apache.samza.config.factories.PropertiesConfigFactory --config-path=file://$PWD/deploy/samza/config/wikipedia-feed.properties
The below job will consume a feed of real-time edits from Wikipedia, and produce them to a Kafka topic called “thelinuxfaq-raw”.
Give the job a minute to startup, and then tail the Kafka topic:
root@dev:/usr/local/hello-samza# deploy/kafka/bin/kafka-console-consumer.sh --zookeeper localhost:2181 --topic thelinuxfaq-raw
Now, you can check out the YARN UI again (http://localhost:8088). without any error, you’ll see your Samza job is running!
Shutdown Samza:
After you done, you can stop everything up using the same grid script.
root@dev:/usr/local/hello-samza # bin/grid stop all
Sample Output:
EXECUTING: stop all
EXECUTING: stop kafka
EXECUTING: stop yarn
stopping resourcemanager
stopping nodemanager
EXECUTING: stop zookeeper
JMX enabled by default
Using config: /usr/local/hello-samza/deploy/zookeeper/bin/../conf/zoo.cfg
Stopping zookeeper ... STOPPED
Start Samza :
Also, you can start up everything using the same grid script,
root@dev:/usr/local/hello-samza # bin/grid start all
Sample Output:
EXECUTING: start all
EXECUTING: start zookeeper
JMX enabled by default
Using config: /usr/local/hello-samza/deploy/zookeeper/bin/../conf/zoo.cfg
Starting zookeeper ... STARTED
EXECUTING: start yarn
....
EXECUTING: start kafka
Comments (0)