Working with Apache Hadoop for big data processing

Apache Hadoop is an open-source framework that allows for the distributed processing of large datasets. It is widely used for big data processing, with users ranging from small organizations to large enterprises. Its popularity stems from its ability to process and store large amounts of data, making it ideal for big data processing.

In this tutorial, you will learn the basics of Apache Hadoop, its architecture, how to install and configure Apache Hadoop on a cluster, and how to process data using Hadoop.

Apache Hadoop Architecture

Apache Hadoop comprises two primary components; Hadoop Distributed File System (HDFS) and MapReduce. HDFS is a distributed file system that provides fault-tolerance, high-throughput access to data, and supports large-scale data processing. MapReduce is a distributed data processing system that allows parallel processing across numerous nodes in a cluster.

HDFS consists of NameNodes and DataNodes. The NameNode tracks the location of data stored in a cluster and manages block replication, while DataNodes are responsible for storing data and handling client read and write requests.

On the other hand, MapReduce splits data into smaller segments and distributes it to multiple worker nodes in a cluster. Each worker node processes their data segments independently, and the results are combined to produce the final output.

Installing and Configuring Hadoop

Before you can start working with Hadoop, you must first install and configure it on your local machine or cluster.

Pre-requisites

Here are some of the pre-requisites required to install and configure Hadoop:

  • Java 8 or higher must be installed on your machine.
  • A Unix machine or Linux Distro (e.g., Ubuntu, Red Hat, Debian, CentOS) is required.
  • Hadoop requires at least 4 GB or more memory on your machine.

Installing Hadoop

To install Hadoop, follow these steps:

  1. First, navigate to the Hadoop download page and download the latest release.
  2. Unpack the downloaded tar file to your desired location using the following command:
    $ tar -xzf hadoop-X.Y.Z.tar.gz
    

    Replace X.Y.Z with the Hadoop version you downloaded.

  3. Configure Hadoop by editing hadoop-env.sh to specify the Java installation directory:

    $ export JAVA_HOME=<path_to_java_directory>
    
  4. Next, open core-site.xml and add the following configuration:
    <configuration>
       <property>
           <name>fs.defaultFS</name>
           <value>hdfs://localhost:9000</value>
       </property>
    </configuration>
    

    This configuration sets the default file system to HDFS and specifies the default NamdeNode and port number.

  5. Open hdfs-site.xml and add the following configuration:

    <configuration>
       <property>
           <name>dfs.replication</name>
           <value>1</value>
       </property>
       <property>
           <name>dfs.namenode.name.dir</name>
           <value>/hadoop/namenode</value>
       </property>
       <property>
           <name>dfs.datanode.data.dir</name>
           <value>/hadoop/datanode</value>
       </property>
    </configuration>
    

    The dfs.replication configuration specifies the replication factor for data blocks, while dfs.namenode.name.dir and dfs.datanode.data.dir specify the location of NameNode and DataNode directories, respectively.

  6. Finally, open mapred-site.xml and add the following configuration:

    <configuration>
       <property>
           <name>mapreduce.framework.name</name>
           <value>yarn</value>
       </property>
    </configuration>
    

    This configuration sets the MapReduce framework to Yarn.

Starting and Stopping Hadoop Services

To start Hadoop services, run the following command:

$ ./start-all.sh

This command starts the NameNode, DataNode, and Yarn ResourceManager and NodeManager.

To stop Hadoop services, run the following command:

$ ./stop-all.sh

This command stops all currently running Hadoop services.

Processing Data with Hadoop

After installing and configuring Hadoop, you can then start processing data. Here, we will use the classic word count example to demonstrate how to process data using Hadoop MapReduce.

Creating an Input File

First, create an input file containing data to process. To do this, create a text file named input.txt and add some text. For example:

Hello world, this is a test file.
This is a line of text in the input file.
This is another line of text in the input file.

Uploading Input File to HDFS

After creating an input file, the next step is to upload it to HDFS. To do this, run the following command:

$ hdfs dfs -put /path/to/input.txt /input

This command copies the input file from your local file system to HDFS at the /input directory.

Writing A Mapper

The mapper is responsible for processing input data and producing key-value pairs as output. Here is a sample mapper implementation:

public class WordCountMapper extends Mapper<LongWritable, Text, Text, IntWritable> {

    private final static IntWritable ONE = new IntWritable(1);
    private final Text word = new Text();

    @Override
    public void map(LongWritable key, Text value, Context context) throws IOException, InterruptedException {
        String line = value.toString();
        StringTokenizer tokenizer = new StringTokenizer(line);
        while (tokenizer.hasMoreTokens()) {
            word.set(tokenizer.nextToken());
            context.write(word, ONE);
        }
    }
}

The mapper reads input data line-by-line, tokenizes each line, and then writes the output key-value pairs to the context object.

Writing A Reducer

The reducer receives key-value pairs produced by the mapper and combines them to produce final output. Here is a sample reducer implementation:

public class WordCountReducer extends Reducer<Text, IntWritable, Text, IntWritable> {

    private final IntWritable sum = new IntWritable();

    @Override
    public void reduce(Text key, Iterable<IntWritable> values, Context context) throws IOException, InterruptedException {
        int count = 0;
        for(IntWritable value : values){
            count += value.get();
        }
        sum.set(count);
        context.write(key, sum);
    }
}

The reducer receives input as key-value pairs, where the key is a unique word, and the value is an IntWritable instance containing the number of occurrences of the word. The reducer iterates over all values from the same key to produce a count.

Running MapReduce Job

After writing the mapper and reducer implementations, we can run a MapReduce job to process the input data. To do this, execute the following command:

$ hadoop jar hadoop-X.Y.Z.jar WordCount /input /output

Replace X.Y.Z with your Hadoop version. In the above command, WordCount is a class that defines your job, /input is the input data directory, and /output is the output directory that stores the result of the MapReduce job.

After executing the above command, Hadoop executes the MapReduce job. Once the job completes successfully, the output directory will contain a file named part-r-00000, which contains the results of the MapReduce job.

Viewing Results

To view the result of the MapReduce job, run the following command to merge the output file into a single file:

$ hdfs dfs -getmerge /output/* /path/to/output.txt

Then, open output.txt and confirm that it contains the expected results:

Hello   1
This    2
a       2
file.   1
in      2
is      2
line    2
of      2
test    1
text    2
this    1
world,  1

Conclusion

Apache Hadoop is a widely used distributed data processing framework for big data applications. In this tutorial, we learned about the components of Hadoop and how to install and configure it on a local machine. We also learned how to process data using Hadoop MapReduce and how to view the results.

With this knowledge, you can start developing applications that can process vast amounts of data using Hadoop.

Related Post