Monday, December 18, 2017

Bash scripting meets Hadoop Ecosystem

Steps to create pseudo-distributed Hadoop Ecosystem 


Step#1 

Download Hadoop Environment Creation.pdf and main_hadoopXx.zip from: 

Update: As on 12/19/2017, main_hadoop2x.zip is recommended. As main_hadoop3x.zip is failing for Tez Jobs.

Step#2

To create your Virtualbox VM, follow instructions (for windows) in Hadoop Environment Creation.pdf

Take a look at these helpful videos:
For MAC -
For Linux/Ubuntu -


Step#3

Extract main.zip on your CentOS7’s desktop. Read below IMPORTANT section and run them in the following sequence (one after the another) i.e. :

1EnableSystem.sh
2ConfigHadoopEcoSys.sh
3StartAll.sh
4Stop-All.sh


IMPORTANT

  1. Under any circumstancesDO-NOT-RUN ANY OF THESE SCRIPTS ON ANY-PHYSICAL-MACHINE (specially Linux and Mac users) as it may damage your Operating System and/or installed applications. Also do not run these on any of your existing VM. Your system may be unstable/unusable. This is entirely intended for learning purpose. So run these scripts inside a newly created VM specifically created for these scripts only.
  2. In most cases Bridged Network Setting works perfectly well on Virtualbox.
  3. After installing CentOS in your VM, try accessing internet. Internet is required to download required packages. If your guest (VM) can not access internet or Host (your physical laptop/desktop), try disabling wifi and try putting a wired connection instead.
  4. During installation you will be prompted to select repository as - DVD or Internet. If your host is able to connect to internet (which should be true in most cases), opt for internet (option 2). It should work for most users who have a working internet connection. DVD option is for linux-comfortable users (just to give an idea look at functions - WhichRepo() and InstallProgsFromDVD() in the script - 1EnableSystem.sh. 
  5. While running 1EnableSystem.sh you will be prompted to set password for MySQL. Use "root" (without quotes) as the password for root user. As this is how the further scripts are configured. i.e MySQL username - root and password - root.
  6. If you do not have internet, create a folder "dwnld" in the same directory where you put your scripts and put all the packages in there. Also get CentOS-7-x86_64-Everything-1708.iso (8GB download) and add it as a cd/dvd to your VM.
  7. To initially load your files to your Freshly Installed CentOS, you can use the following iso creator softwares:
           Windows
           https://www.ezbsystems.com/ultraiso/   
           https://www.ezbsystems.com/dl1.php?file=uiso9_pe.exe
            
           Linux:
           Ubuntu - sudo apt install brasero
           CentOS/Fedora - yum install brasero / dnf install brasero
           
           After creating iso using above software, as you did while adding iso during your setup of                       CentOS operating system, put your files in a folder on your CentOS's desktop then run your                 scripts as mentioned above.


Once you are all done, you should have a functional Hadoop (pseudo-distributed) environment with the following components of its EcoSystem :

HDFS, YARN, SPARK, TEZ, PIG, SQOOP, FLUME, ACCUMULO, HBASE, HIVE, HCATELOG, STORM, ZOOKEEPER


Components of Hadoop EcoSystem - https://hortonworks.com/ecosystems/

Quick Links 

After completion of all scripts (Refer comments for more details inside the scripts), access these urls:

http://hostname:9870/dfshealth.html
http://hostname:50070/dfshealth.html # Hadoop 2
http://hostname:9864/datanode.html
http://hostname:50075/datanode.html # Hadoop 2
http://hostname:9864/logs/
http://hostname:50075/logs/ # Hadoop 2
http://hostname:9868/status.html
http://hostname:50090/status.html # Hadoop 2
http://hostname:8088/cluster/nodes
http://hostname:8042/node/allApplications
http://hostname:19888/jobhistory
http://hostname:16010/master-status  # Hbase
http://hostname:8983/solr/  # Solr
http://hostname:8080  # Storm
http://hostname:9995 # Accumulo


  • Face any issues or have any suggestions, put your comments at the bottom, along with the suggestion and/or error message/description you have...!! I will try to resolve as much as I can...and as soon as I can....!

Stay tuned for rest of the modules/components of Hadoop Eco-System.......Happy Learning..!!


Regards,
~AK


Some honorable references: 

https://www.tutorialspoint.com/hadoop/
http://hadooptutorial.info
https://github.com/NitinKumar94/Installing-Apache-Tez
https://hostpresto.com/community/tutorials/how-to-install-apache-hadoop-on-a-single-node-on-centos-7/
http://fibrevillage.com/storage/617-hadoop-2-7-cluster-installation-and-configuration-on-rhel7-centos7
https://www.tecmint.com/install-configure-apache-hadoop-centos-7/
http://www.mcclellandlegge.com/2017-01-01-installhadoop/
https://unskilledcoder.github.io/hadoop/2016/12/10/hadoop-cluster-installation-basic-version.html
https://blog.cloudera.com/blog/2014/01/how-to-create-a-simple-hadoop-cluster-with-virtualbox/
http://www.michael-noll.com/tutorials/running-hadoop-on-ubuntu-linux-single-node-cluster/
http://www.michael-noll.com/tutorials/running-hadoop-on-ubuntu-linux-multi-node-cluster/
https://www.dezyre.com/hadoop-tutorial/hadoop-multinode-cluster-setup
http://opensourceforu.com/2015/03/building-a-multi-node-hadoop-cluster-on-ubuntu/
https://acadgild.com/blog/integrating-apache-tez-with-hadoop/

4 comments:

  1. Nice post.
    Followed your instructions and worked perfectly. Please also post scripts for hadoop cluster. Keep up the good work you are doing.

    ReplyDelete
    Replies
    1. Sure. I am working on scripts for cluster.
      Thanks
      AK

      Delete
  2. Hello AK,

    I was able to setup CentOS and Virtual box in my mac. The instructions provided are very useful and easy to understand. Thank you very much for the post and look forward more about hadoop.

    Thanks
    Babu

    ReplyDelete
    Replies
    1. Thanks Babu. I am glad that it worked for you.
      Will post more detailed write-ups about individual components.

      Delete

Please leave a comment/suggestion here...