Hadoop for storing big data

Hadoop is an Apache Foundation project and has the full formal name of Apache Hadoop. Apache Foundation projects are free and open source. Hadoop is aimed at managing millions of massive files for batch processing and streaming. While Hadoop might be a lot of work to set up, and there are good alternatives to everything Hadoop does, Hadoop has built up into a ready made package of useful things that work together, creating an easier to maintain system compared to a build your own system.

Hadoop is not for frequent file updates or SQL/NoSQL style document/record updates. The Apache Cassandra project is a "NoSQL" database for record level updates of big data.

Hadoop 3.2.0, from January 2019, is the current release while I am writing. The download, hadoop.apache.org, is a 346 MB .tar.gz file. With a file this size, they should experiment with more modern download formats including 7-Zip files. The download expands out to 24,797 items consuming 869 MB on disk.

What it does

Hadoop supplies a file system oriented to many files, big files, redundancy, and parallel processing. As an alternative, you would have to set up one of several large scale file systems, or cloud storage, then build in RAID for local redundancy then add replication across racks in your data centre then add replication across data centres.

Hadoop includes the M & Ms, monitoring, measurement, and management software. They all work together and automatically work with the file system. You would have a hard time building up that range of M & Ms from separate projects.

Hadoop supplies batch ETL, Extract, Transform, Load, software in the form of a MapReduce package. What you might lose by using the Hadoop MapReduce, you gain from the integration and automation. Hadoop can split the batch process into smaller components and run them in parallel. You could do this by hand but then you have to manually adjust everything. Hadoop lets you specify a more general plan with Hadoop components performing the detail calculations for multiple processes.

Think about a large sales system producing a huge transaction log every hour. Marketing as for a list of the top 10,000 spenders over the last month so that marketing can bombards the poor customers with advertising. You have to step through the following data selection and refinements.

  1. Select the range of logs

  2. Extract the top spenders from each log

  3. Merge the results

  4. Extract the top spenders from the merged results

  5. Report the top spenders in a form useful to the mail/SMS/advert blasting software

When there is a lot of data, perhaps the top spenders for the last year, you might have to merge several times. Perhaps merge the results for a week then merge the results across multiple weeks.

Hadoop can be configures to select the right files for the process then create one extract process per log. The extract process will run on the computer storing the log. When there are many logs on the one computer, Hadoop can spread the workload over all the computers containing replications of the logs.

Hadoop can perform a lot of things automatically once you configure everything. You will have to check processing and increase file replication when there are bottlenecks. There is a Hadoop tool for that. When you use many of the Hadoop features, Hadoop is easier than fiddling with several separate tools.

Components

The download includes the Hadoop server, a client, HDFS, MapReduce, YARN, and a selection of tools.

Benchmarking

Hadoop includes tools for benchmarking, something you can do only after studying the way HDFS, MapReduce, and YARN work. You will also need a realistic number of machines. Your first tuning job might be to increase replication for popular data.

One simple consideration is the low cost of each computer. Low cost computers offer low performance. You need fast disks and fast network connections in each device. You need many many devices to get performance and you need massive replication for popular data. All that replication requires a fast network. Never cut corners on network speed.

Hadoop MapReduce

MapReduce has some similarities to parts of an SQL query, something most developers already understand. There are three steps, Map, x, and Reduce.

The Map step finds data in files. There is a separate search for data from each file and they run in parallel. This could be compared to the data list and the where parts of an SQL query. The multiple requests could be compared to multiple subqueries in SQL.

The MapReduce software then sorts the the data. You will, most likely, sort everything into the same sequence for a merge and match operation.

The Reduce part brings together the separate data components. This could be compared to a join in SQL.

The MapReduce request is actually written in Java, making the request unreadable for anyone outside the hard core developers. You could compare this to the primitive database requests used before the invention of SQL. Back then the most common problem was matching up the request invented by the developer with the request from the user. SQL let the user specify the request without fighting with developers.

HDFS

HDFS is the Hadoop Distributed File System. If you are not using more than one machine to test Hadoop, you cannot test the advantage of HDFS.

There are several "distributed" file systems. HDFS is designed for use on "commodity" hardware which means you could test across some old PCs you no longer use for other projects.

HDFS assumes there will be regular hardware failures due to the use of hundreds or thousands of computers. Recovery should be automatic, fast, complete, and reliable. This means you could run HDFS instead of using "cloud" storage. Choosing a cloud storage option would protect both the data in the HDFS and all other data.

HDFS assumes you want data streaming or batch processing, not an interactive database. (The Apache Foundation offers the Casandra database for big data in a database.) To achieve the high streaming an batch throughput, Hadoop drops some Posix requirements, making HDFS not compatible with some applications and uses. You would use HDFS purely for Hadoop.

HDFS is designed to handle tens of millions of files with many files of multiple terabytes. The file access is designed for writing once then many reads. This is the opposite of what you need for databases and log files. Hadoop is more useful for a system where each update is a new version of a file. You could use Hadoop for document storage with each edit producing a new version and all old versions retained for "point in time" auditing.

A HDFS "cluster" has a master server controlling a set of slave servers. The master, the NameNode, controls allocation of files and data across the slaves, the DataNodes. Each file is replicated across more than one DataNode, similar to RAID. For a useful test configuration, you would need the minimum of four computers, one for the NameNode, at least two for DataNodes to allow replication, then at least one client computer.

Java

There is no Java included in the download. You have to install a compatible release of Java.

TCP/IP

Hadoop uses TCP/IP for communications between components. You have to configure your network to allow the right connections. You also have to configure your network to provide sufficient speed.

Every request to Hadoop starts with a request from the client to the NameNode to find the location of the data. The next part is full of requests to multiple DataNodes to bring the data together. In the background the NameNode could be replicating data between DataNodes to recover from a failure or moving data between DataNodes as part of load balancing.

YARN

Apache Hadoop YARN, Yet Another Resource Negotiator, is a fancy job scheduler based on a resource manager. This lets YARN spread jobs over many small computers. YARN runs in a resource management machine and talks with the NameNode machines to track resource usage. Assuming the data is replicated several times across multiple machines, each job that accesses the data can run on a different machine.

A large file could be spread over many machines and a job component might only need the data in one machine. This makes job scheduling and load balancing more complex but can make it more efficient. MapReduce talks with YARN to request the minimum resources. For popular data and files, you might have to add extra machines and increase replication.

Replication

Replication is the most important part of a Hadoop layout. The flexibility goes as far as letting you identify the rack housing each computer/DataNode. You can ensure the replication spreads over multiple racks to allow for a whole rack failing. I presume a data centre spread over multiple buildings could be configured to have at least one rack in each building. All you would need are some dedicated network connections, preferably fibre, to allow for all that data replication.

Performance

Hadoop has so many performance options that you need a specialist dedicated to watching everything and working out the best configuration. You have options including storing data in memory to save disk activity.

The memory option writes and reads from RAM. In the background, the data is slowly written to disk. The disk writes could fail, resulting in a data loss. This would not matter in cases where the original data is already on disk.

As an example, you might maintain a data summary that is read frequently. If the data summary disappears, you can recreate the summary from detailed data or restart from an older version of the summary. Things like the "top 10 books read this week" could be lost and restarted at any point. The actual list of books with their reading statistics would be processed on disk. Recovery of the summary might take a few minutes but the occasional recreation could create less disk activity than constantly writing the summary to disk every time someone reads a book.

Users of Hadoop

Who uses Hadoop? Lots of big name American organisations are listed as users of Hadoop but there are no uses cases. They could be using Hadoop on one test server in a corner of one project. This often happens when something is fashionable but too difficult to get beyond a test case.

I have not found a detailed explanation of a case where Hadoop is a better solution than something like Cassandra used with a regular large scale file system.

Uses for Hadoop

ETL, a modern name for an old batch processing approach, appears to be the only common use for Hadoop. Many ETL examples can be replaced by more modern processing. Cassandra and similar software have, collectively, far more users for this type of processing on newly created data. Hadoop wins where there are many old files not worth importing into databases.

ETL

ETL is Extract, transform, Load, an ancient approach to batch processing data for long term analysis. Hadoop can be used to extract data from mainframes and perform ETL using MapReduce. There are many ways to perform ETL. I do not see the advantage of adding Hadoop outside of your primary data store, especially if it is a mainframe. Hadoop would work better if the old files were moved into Hadoop on a rack of low cost processors so you could sell you mainframe.

Load

Where do you load the data you extract? I suggest you load the data into a modern database for repeated analysis. When you use a bid data oriented database, you could extract straight into a database then use the database for the transform, bypassing the ETL Load step because the data is already in a database.

Mainframes

Hadoop replaces mainframes with a mass of low cost computers. Before Hadoop started doing that, mainframes started converting to a mass of low cost processors. The main advantage of Hadoop is splitting up each request into many small steps and distributing the processing to the computers storing the data. This approach has existed forever. I introduced the approach at some mainframe sites some 20 years before Hadoop was invented.

Networks

Network speed was and is a limiting factor, for both Hadoop and all the predecessors. There is no mention of network monitoring or network load balancing in Hadoop, only disk and processor load balancing. You will spend a lot of time fixing your network to make Hadoop or any of the alternatives work. Today high speed fibre replaces the coax cable I had to fight back in the day.

Low cost commodity computers do not have fibre connections. You have to lump together a set of those computers in a rack then put an expensive switch in the rack and connect the racks with expensive fibre. You need a way to monitor the switches and the connections between racks. That type of network is not down in the commodity hardware range, although it is close. You might use an almost commodity switch to connect a few racks in a row then use some of the ultra expensive NotCommodity switches to connect the rows of racks.

You then need unbelievable expensive switches to connect your data centres.

Increase replication to decrease network usage. If you do the replication the right way, there will be less moving of data and workloads across the network. The next network usage bill might not make your eyes burst out of your head.

SQL and NoSQL based alternatives

You need masses of files to justify the setup costs for Hadoop. Hadoop can be mostly replaced by a regular file system in cloud based processing with an SQL database for managing file versioning and one of the large scale search packages. The Hadoop data and processing management components start to save you time when the volume of data overwhelms your staff.

When you have a specific requirement that is difficult in an SQL database, look at Apache Cassandra and similar NoSQL products. Cassandra could be considered a step away from SQL toward Hadoop with lots of shared configuration requirements but none of the restrictions on record level processing.

There are lots of configuration similarities between Cassandra, NoSQL, and Hadoop. The differences start with the documents/files. Hadoop stores files with no restriction on content. The Hadoop MapReduce processing favours things like log files. NoSQL is better for structured documents with each NoSQL package favouring a specific document format. As an example, MongoDB uses a JSON style document format and that makes MongoDB a favourite among people who need to publish extracts of documents online via a Javascript framework.

SQL and NoSQL databases can spread masses of data over many low cost commodity computers using sharding and similar approaches. This is similar to Hadoop spreading files out for replications and distributed processing. Databases are popular for new data. Loading masses of old data into databases is not popular and is often the starting point for using Hadoop.

Whichever approach you use, you need to understand your data before you choose an approach. As an example, you might leave all your old data in Hadoop and put all your new data in a MariaDB cluster. You might then decide to backload a year of data from Hadoop into MariaDB so your customers can look at previous clothing purchases from this season last year. Decisions like that are based on the decreasing code of database storage and the need to lower people costs and increase service by letting customers manage their accounts.

If I was running an online shop, I would start with the database first, backload everything for the customers, and limit Hadoop to data that customers cannot use.

Summary

Hadoop sounds like a solution looking for a problem. There are better ways to handle large numbers of large files for many types of applications. Hadoop kicks in somewhere between "large" and "mega". Hadoop has almost everything you need bundled in the download, saving you time compared to assembling your own mix of software tools. There is a huge configuration workload up front then you should save time through the Hadoop integration and automation. Eventually you should have a system that will expand to handle millions of massive files with no dramatic increase in management staff. Look at Apache Hadoop and Apache Cassandra/NoSQL then choose the right software for each application.