4

According to the HDFS Architecture page HDFS was designed for "streaming data access". I'm not sure what that means exactly, but would guess it means an operation like seek is either disabled or has sub-optimal performance. Would this be correct?

I'm interested in using HDFS for storing audio/video files that need to be streamed to browser clients. Most of the streams will be start to finish, but some could have a high number of seeks.

Maybe there is another file system that could do this better?

Van Gale
  • 472
  • 1
  • 5
  • 10
  • Streaming Access Pattern in HDFS is: Write Once, Read Any Number Of Times, But Don't Try To Change The Contents Of The File. –  Jul 08 '15 at 17:37

3 Answers3

3

HDFS stores data in large blocks -- like 64 MB. The idea is that you want your data layed out sequentially on your hard drive, reducing the number of seeks your hard drive has to do to read data.

In addition, HDFS is a user-space file system, so there is a single central name node that contains an in-memory directory of where all of the blocks (and their replicas) are stored across the cluster. Files are expected to be large (say 1 GB or more), and are split up into several blocks. In order to read a file, the code asks the name node for a list of blocks and then reads the blocks sequentially.

The data is "streamed" off the hard drive by maintaining the maximum I/O rate that the drive can sustain for these large blocks of data.

2

Streaming just implies that it can offer you a constant bitrate above a certain threshhold when transferring the data, as opposed to having the data come in in bursts or waves.

If HDFS is laid out for streaming, it will probably still support seek, with a bit of overhead it requires to cache the data for a constant stream.

Of course, depending on system and network load, your seeks might take a bit longer.

towo
  • 1,887
  • 14
  • 12
0

For Streaming Data From the Hadoop: The Definitive Guide, 3rd Edition:

HDFS is built around the idea that the most efficient data processing pattern is a write-once, read-many-times pattern. A dataset is typically generated or copied from source, and then various analyses are performed on that dataset over time. Each analysis will involve a large proportion, if not all, of the dataset, so the time to read the whole dataset is more important than the latency in reading the first record.