Introduction; Advanced Big Data Hadoop
Drive better business decisions with an overview of how big data is organized, analyzed, and interpreted. Apply your insights to real-world problems and questions.
Do you need to understand big data and how it will impact your business? This Specialization is for you. You will gain an understanding of what insights big data can provide through hands-on experience with the tools and systems used by big data scientists and engineers. Previous programming experience is not required! You will be guided through the basics of using Hadoop with MapReduce, Spark, Pig, and Hive. By following along with the provided code, you will experience how one can perform predictive modeling and leverage graph analytics to model problems. This specialization will prepare you to ask the right questions about data, communicate effectively with data scientists, and do basic exploration of large, complex datasets.
1.Introduction to Hadoop and Big-data
• Introduction to Big-data?
• Introduction to Hadoop?
• Business problems / Challenges with Big data?
• Scenarios where Hadoop is used?
• Overview of Batch Processing and real-time data analytics using Hadoop?
• Hadoop vendors – Apache, Cloudera, Hortonworks?
• Hadoop versions – Hadoop 1.x and Hadoop 2.x?
• Hadoop services – HDFS, MapReduce, YARN?
• Introduction to Hadoop ecosystem components
(Hive, HBase, Pig, Sqoop, Flume, Zookeeper, Kafka, Spark)?
2.Cluster setup (Hadoop 1.x)
• Linux VM installation on the system for the Hadoop cluster using Oracle Virtual Box?
• Preparing nodes for Hadoop and VM settings?
• Install Java and configure passwordless SSH across nodes?
• Basic Linux commands?
• Hadoop 1.x Single node deployment?
• Hadoop Daemons – NameNode, JobTracker,
DataNode, TaskTracker, Secondary NameNode?
• Hadoop configuration files and running?
• Important Web URLs and Logs for Hadoop?
• Run HDFS and Linux commands?
• Hadoop 1.x multi-mode deployment?
• Run sample jobs in Hadoop single and multi-node clusters?
• HDFS Concepts
• HDFS Design Goals
• Understand Blocks and how to configure block size
• Block replication and replication factor
• Understand Hadoop Rack Awareness and configure racks in Hadoop
• File read and write anatomy in HDFS Health monitoring using FSCK command?
• Understand NameNode Safemode, File system Image, and Edits?
• Configure Secondary NameNode and use the checkpointing process to provide NameNode failover?
• HDFS DFSAdmin and File system shell Commands?
• Hadoop Namenode / Datanode directory Structure?
4. MapReduce Concepts
• Introduction to MapReduce?
• MapReduce Architecture?
• Understanding the concept of Mappers & Reducers?
• Anatomy of MapReduce Program?
• Phases of a MapReduce program?
• Data-types in Hadoop MapReduce?
• Driver, Mapper, and Reducer classes?
• InputSplit and RecordReader?
• InputFormat and OutputFormat in Hadoop?
• Concepts of Combiner and Partitioner?
• Running and Monitoring MapReduce jobs?
• Writing your own MapReduce job using MapReduce API?
5. Cluster setup (Hadoop 2.x)
• Hadoop 1.x Limitations?
• Design Goals for Hadoop 2.x?
• Introduction to Hadoop 2.x?
• Introduction to YARN?
• Components of YARN – ResourceManager, NodeManager, ApplicationMaster?
• Deprecated properties?
• Hadoop 2.x Single node deployment?
• Hadoop 2.x multi-mode deployment?
6.HDFS High Availability and Federation
• Introduction to HDFS Federation
• Understand Nameservice ID and block pools
• Introduction to HDFS High Availability
• Failover mechanisms in Hadoop 1.x
• Concept of Active and Standby NameNode
• Configuring Journal Nodes and avoiding a split-brain scenario
• HDFS HAadmin commands?
7. YARN – Yet Another Resource Negotiator
• YARN Architecture?
• YARN Components – ResourceManager, NodeManager, JobHistoryServer, Application TimelineServer, MRApplicationMaster?
• YARN Application execution flow?
• Running and Monitoring YARN Applications?
8.Apache Zookeeper
• Introduction to Apache Zookeeper?
• Zookeeper stand-alone installation?
• Zookeeper clustered installation?
• Understand Znode and Ephemeral nodes?
• Manage Znodes using Java API?
• Zookeeper four-letter word commands?
9. Apache Hive
• Introduction to Hive?
• Hive Architecture?
• Components – Metastore, HiveServer2, Beeline, HiveCli,
Hive WebInterface?
• Installation and configuration?
• Metastore service?
• DDLs and DMLs?
• SQL – Select, Filter, Join, Group By?
• Hive Partitions and buckets in Hive?
• Install and configure HCatalog services?
10. Apache Pig
• Introduction to Pig
• Pig installation
• Accessing Pig Grunt shell?
• Pig Data Types?
• Pig commands?
• Pig Relational Operators?
11. Apache Sqoop
• Introduction to Sqoop
• Sqoop Architecture and Installation
• Import data using Sqoop in HDFS
• Import all tables in Sqoop
• Import tables directly in Hive
• Export data from HDFS
12.Apache Flume
• Introduction to Flume
• Flume Architecture and Installation
• Define Flume agent – Sink, Source and Channel
• Flume Use Cases
13.Apache HBase
• Introduction to HBase
• HBase Architecture
• HBase components — HBase Master and RegionServers
• HBase installation and configurations
• Create sample tables and queries on HBase
14.Apache Spark / Storm / Kafka
• Real-time data Analytics?
• Introduction to Spark / Storm / Kafka
15.Cluster Monitoring and Management tools
• Cloudera Manager?
• HUE
Projects are:-
• Pokémon Data Analysis
• Flight data analysis
• Sales data analysis
• Stock Data analysis