1. HDFS command
2. Pig LOAD command
3. Sqoop import
4. Hive LOAD DATA command
5. Ingest with Flume agents
Ans : 3
Exp :Apache Hadoop and Pig provide excellent tools
for extracting and analyzing data from very large Web logs.
We use Pig scripts for sifting through the data and
to extract useful information from the Web logs.
We load the log file into Pig using the LOAD
command.
raw_logs = LOAD 'apacheLog.log' USING TextLoader AS
(line:chararray);
Note 1:
Data Flow and Components
*Content will be created by multiple Web servers
and logged in local hard discs. This content will then be pushed to HDFS using
FLUME framework.
FLUME has agents running on Web servers; these are
machines that collect data intermediately using collectors and finally push
that data to HDFS.
*Pig Scripts are scheduled to run using a job
scheduler (could be cron or any sophisticated batch job solution). These
scripts actually analyze
the logs on various dimensions and extract the
results. Results from Pig are by default inserted into HDFS, but we can use
storage implementation
for other repositories also such as HBase, MongoDB,
etc. We have also tried the solution with HBase (please see the implementation
section). Pig
Scripts can either push this data to HDFS and then
MR jobs will be required to read and push this data into HBase, or Pig scripts
can push this data into HBase directly. In this article, we use scripts to push
data onto HDFS, as we are showcasing the Pig framework applicability for log
analysis at large scale. *The database HBase will have the data processed by
Pig scripts ready for reporting and further slicing and dicing. *The
data-access Web service is a REST-based service that eases the access and integrations
with data clients. The client can be in any language to access REST-based API.
These clients could be BI- or UI-based
clients.
Note 2:
The Log Analysis Software Stack
*Hadoop is an open source framework that allows
users to process very large data in parallel. It's based on the framework that
supports Google search engine. The Hadoop core is mainly divided into two
modules: 1.HDFS is the Hadoop Distributed File System. It allows you to store
large amounts of data using multiple
commodity servers connected in a cluster.
2.Map-Reduce (MR) is a framework for parallel processing of large data sets.
The default implementation is
bonded with HDFS. *The database can be a NoSQL
database such as HBase. The advantage of a NoSQL database is that it provides
scalability for the reporting module as well, as we can keep historical
processed data for reporting purposes. HBase is an open source columnar DB or
NoSQL DB, which uses HDFS. It can also use MR jobs to process data. It gives
real-time, random read/write access to very large data sets -- HBase can save
very large tables having million of rows. It's a distributed database and can
also keep multiple versions of a single row. *The Pig framework is an open
source platform for analyzing large data sets and is implemented as a layered
language over the Hadoop Map-Reduce framework. It is built to ease the work of
developers who write code in the Map-Reduce format, since code in Map-Reduce
format needs to be written in Java. In contrast, Pig enables users to write
code in a scripting language. *Flume is a distributed, reliable and available
service for collecting, aggregating and moving a large amount of log data (src
flume-wiki). It was built to push large logs into Hadoop-HDFS for further
processing. It's a data flow solution, where there is an originator and
destination for each node and is divided into Agent and Collector tiers for
collecting logs and pushing them to destination storage.