Hadoop DP Notes: Why can’t we just have the file in HDFS and have the application read it instead of distributed cache ?

Distributed cache copies the file to all node managers at the start of the job. Now if the node manager runs 10 or 50 map or reduce tasks, it will use the same file copy from distributed cache.
On the other hand, if a file needs to read from HDFS in the job then every map or reduce task will access it from HDFS and hence if a node manager runs 100 map tasks then it will read this file 100 times from HDFS. Accessing the same file from node manager’s Local FS is much faster than from HDFS data nodes.

Hadoop DP Notes

Search

Why can’t we just have the file in HDFS and have the application read it instead of distributed cache ?

Visitors