Engineering Full Stack Apps with Java and JavaScript
Though it says interview questions, this page list down questions that can be also used to test your understanding of BigData and Hadoop’s basics and about Hadoop’s component technologies that make up the Hadoop technology stack. This doesn’t go deeper into any of the technology stack component. Having a bigger picture and knowing how the components fit together will help you make decisions in using the right component in the right way.
These are questions without answers, but there will be hints along with most questions that may be used by interviewers to give additional inputs to the candidate and can be used by candidates to quickly get a hint about the question, like where a component might fit in the overall picture.
BigData and Hadoop Basics
What is Big Data?
Hint: Broad term for data sets so large or complex that traditional data processing applications are inadequate.
What is Hadoop?
Hint: HDFS + MapReduce + Libraries (HBase, Pig etc.).
How is Google File System (GFS) related to Hadoop?
Hint: Hadoop originally based on a whitepaper on GFS.
Why Hadoop?
Hint: Cheaper (commodity hardware), Faster (parallel processing).
List few use cases where Hadoop can be used?
Hint: Risk modeling, recommendation engine, Ad targeting, search engine quality.
What are the core Hadoop components?
Hint: HDFS and MapReduce.
What are the differences between Hadoop 1.x and Hadoop 2.x?
Hint: YARN
What are the differences between RDBMS and Hadoop way of treating the data?
RDBMS=schema on write, Hadoop=Schema on read.
What are the disadvantages of using traditional relational databases for data analytics?
Hint: Scalability, Speed etc.
Many people compare RDBMS with Hadoop. Is Hadoop a database?
Hint: Hadoop is a file system with processing store. May be used along with a database (mostly NoSQL database)
Will RDBMS still be useful with the popularity of Hadoop?
Hint: They solve a different problem.
What are NoSQL Databases?
Hint: Data that is modeled in means other than the tabular relations used in relational databases. no fixed columns.
List few types of NoSQL databases with examples?
Hint: key/value, columnstore, documentstore etc.
What is a wide column store NoSQL Database?
Hint: Width of column varies. E.g. HBase
What do you know about CAP theorem?
Hint: Consistency, Availability, Partition tolerance.
How and where do Hadoop fit in the CAP theorem?
Hint: Scalability (Partitioning), Flexibility (Availability).
What kinds of data are good fit for Hadoop?
Hint: Behavioral Data.
What kinds of data are not a good fit for Hadoop?
Hint: Transactional data.
Mostly Hadoop is used along with NoSQL databases. Can Hadoop be used with RDBMS? Explain.
What are Hadoop’s alternative products or solutions?
Hint: Disco, Filemap, Zillabyte etc.
What the different distributions of Hadoop Available?
Hint: open source (Apache Hadoop), commercial (Cloudera, HortonWorks, MapR), cloud (AWS with open source or commercial hadoop , Windows Azure HDInsight).
What are the different hadoop solutions available from Cloudera?
Hint: Cloudera Enterprise, Cloudera Live etc.
What is Hue?
Hint: GUI part of paid cloudera live distribution.
What do you know about hadoop solutions available from HortonWorks?
Hint: Windows and Linux versions, VMs with installations.
What do you know about hadoop solutions available from MapR?
Hint: NoSQL-DB file system, add ons to apache projects, sandboxes.
What do you know about cloud initiatives based on Hadoop?
Hint: AWS Elastic Map Reduce, Microsoft HDInsight.
What do you mean by Hadoop incubator projects? Can you list anyone from it?
How do you compare Hadoop data processing with Grid Computing?
Hadoop Technology Stack
What do you know about the below components (or libraries) and how are they related to Hadoop?
HDFS
Hint: Hadoop Distributed File System, part of hadoop core.
MapReduce
Hint: Programming model for processing data in Hadoop, part of hadoop core.
YARN
Hint: Stands for Yet Another Resource Negotiator, Map Reduce v2, part of hadoop core.in Hadoop v2, cluster resource management system, allows any distributed program (not just MapReduce) to run on data in a Hadoop cluster.
Hbase
Hint: A key-value store, wide columnstore, NoSQL, uses HDFS for its underlying storage.
Hive
Hint: HQL, Query language for HBase.
Pig
Hint: Scripting language
Mahout
Hint: Machine learning, predictive analysis.
Oozie
Hint: Workflow, coordination of jobs.
Zookeeper
Hint: Coordination
Sqoop
Hint: Data Exchange (RDBMS)
Flume
Hint: Log Collector.
Ambari
Hint: Managing Hadoop Clusters
Cassandra
Drill
park
Shark
HCatalog
Lucene
Hama
Crunch
Avro
Thrift
Chukwa
What are the differences between MapReduce 1 and YARN?
What are the GUI tools available for managing hadoop HDFS, MapReduce and/or YARN? Have you used any?
Can you run hadoop (map reduce) on regular file system without HDFS?
Hint: Standalone.
Can you run hadoop (map reduce) on cloud file system without HDFS?
Hint: Amazon S3, Azure BLOB storage.
Wikipedia pages for all products listed here (if available).
CBT Nuggets Apache Hadoop
Lynda.com Hadoop Fundamentals
Visit http://javajee.com/bigdata-and-hadoop-course-plan.
Fill the following form selecting course as Big Data and Hadoop: http://javajee.com/content/volunteer-learning-program.