Abstract:
Cloud Computing provides different services to the users with regard to processing data. One of the main concepts in Cloud Computing is BigData and BigData analysis. BigData is a complex, un-structured or very large size of data. Hadoop is a tool or an environment that is used to process BigData in parallel processing mode. The idea behind Hadoop is, rather than send data to the servers to process. Hadoop divides a job into small tasks and sends them to servers. These servers contain data, process the tasks and send the results back to the master node in Hadoop. Hadoop contains some limitations that could be developed to have a higher performance in executing jobs. These limitations are mostly because of data locality in the cluster, jobs and tasks scheduling, CPU execution time, or resource allocations in Hadoop. Data locality and efficient resource allocation remains a challenge in cloud computing MapReduce platform. We propose an enhanced Hadoop architecture that reduces the computation cost associated with BigData analysis. At the same time, the proposed architecture addresses the issue of resource allocation in native Hadoop. The proposed architecture provides an efficient distributed clustering approach for dedicated cloud computing environments. Enhanced Hadoop architecture leverages on NameNode’s ability to assign jobs to the TaskTrakers (DataNodes) within the cluster. By adding controlling features to the NameNode, it can intelligently direct and assign tasks to the DataNodes that contain the required data. Our focus is on extracting features and building a metadata table that carries information about the existence and the location of the data blocks in the cluster. This enables NameNode to direct the jobs to specific DataNodes without going through the whole data sets in the cluster. It should be noted that newly build lookup table is an addition to the metadata table that already exists in the native Hadoop. Our development is about processing real text in text data sets that might be readable such as books, or not readable such as DNA data sets. To test the performance of proposed architecture, we perform DNA sequence matching and alignment of various short genome sequences. Comparing with native Hadoop, proposed Hadoop reduced CPU time, number of read operations, input data size, and another different factors.