HDFS 架构 · 大数据宝典

# HDFS 架构 [TOC] ## 1. 介绍 The Hadoop Distributed File System (HDFS) is a distributed file system designed to run on commodity hardware. It has many similarities with existing distributed file systems. However, the differences from other distributed file systems are significant. HDFS is highly fault-tolerant and is designed to be deployed on low-cost hardware. HDFS provides high throughput access to application data and is suitable for applications that have large data sets. HDFS relaxes a few POSIX requirements to enable streaming access to file system data. HDFS was originally built as infrastructure for the Apache Nutch web search engine project. HDFS is part of the Apache Hadoop Core project. The project URL is[http://hadoop.apache.org/](http://hadoop.apache.org/). Hadoop分布式文件系统（HDFS）是一个设计用于在商品硬件上运行的分布式文件系统。它与现有的分布式文件系统有许多相似之处。但是，与其他分布式文件系统的区别也很显著。HDFS具有高度的容错性，设计用于部署在低成本硬件上。HDFS 提供对应用程序数据的高吞吐量访问，适用于具有大数据集的应用程序。HDFS放宽了一些POSIX要求，以实现对文件系统数据的流式访问。HDFS最初是作为ApacheNutch Web搜索引擎项目的基础设施构建的。HDFS是 Apache Hadoop 核心项目的一部分。项目URL是[http://hadoop.apache.org/](http://hadoop.apache.org)。 ## 2. 设想与目标 ### 硬件故障 Hardware failure is the norm rather than the exception. An HDFS instance may consist of hundreds or thousands of server machines, each storing part of the file system’s data. The fact that there are a huge number of components and that each component has a non-trivial probability of failure means that some component of HDFS is always non-functional. Therefore, detection of faults and quick, automatic recovery from them is a core architectural goal of HDFS. 硬件故障是正常的而非异常的。HDFS实例可能由数百或数千台服务器组成，每台服务器都存储文件系统的部分数据。事实上，有大量的组件，并且每个组件都有不小的失败概率，这意味着HDFS的某些组件总是不起作用。因此，检测和快速、自动地从故障中恢复是HDFS的核心体系结构的目标。 ### 流数据访问 Applications that run on HDFS need streaming access to their data sets. They are not general purpose applications that typically run on general purpose file systems. HDFS is designed more for batch processing rather than interactive use by users. The emphasis is on high throughput of data access rather than low latency of data access. POSIX imposes many hard requirements that are not needed for applications that are targeted for HDFS. POSIX semantics in a few key areas has been traded to increase data throughput rates. ### Large Data Sets Applications that run on HDFS have large data sets. A typical file in HDFS is gigabytes to terabytes in size. Thus, HDFS is tuned to support large files. It should provide high aggregate data bandwidth and scale to hundreds of nodes in a single cluster. It should support tens of millions of files in a single instance. ### Simple Coherency Model HDFS applications need a write-once-read-many access(访问) model for files. A file once created, written, and closed need not be changed except for appends and truncates. Appending the content to the end of the files is supported but cannot be updated at arbitrary point. This assumption simplifies data coherency issues and enables high throughput data access. A MapReduce application or a web crawler application fits perfectly with this model. ### “Moving Computation is Cheaper than Moving Data” A computation requested by an application is much more efficient if it is executed near the data it operates on. This is especially true when the size of the data set is huge. This minimizes network congestion and increases the overall throughput of the system. The assumption is that it is often better to migrate the computation closer to where the data is located rather than moving the data to where the application is running. HDFS provides interfaces for applications to move themselves closer to where the data is located. ### Portability Across Heterogeneous Hardware and Software Platforms HDFS has been designed to be easily portable(便携式的;手提的;轻便的) from one platform to another. This facilitates widespread adoption of HDFS as a platform of choice for a large set of applications. ## 3.NameNode and DataNodes HDFS has a master/slave architecture [ˈɑːkɪtektʃə(r)] . An HDFS cluster consists of a single NameNode, a master server that manages the file system namespace and regulates[(用规则条例)约束，控制，管理;调节，控制(速度、压力、温度等)] access to files by clients. In addition, there are a number of DataNodes, usually one per node in the cluster, which manage storage attached to the nodes that they run on. HDFS exposes(暴露) a file system namespace and allows user data to be stored in files. Internally(内部地;), a file is split into one or more blocks and these blocks are stored in a set of DataNodes. The NameNode executes(实行;执行;实施;) file system namespace operations like opening, closing, and renaming files and directories. It also determines(查明;测定;准确算出;决定;形成;支配;影响;确定;裁决;安排) the mapping of blocks to DataNodes. The DataNodes are responsible for serving(提供；服务) read and write requests from the file system’s clients. The DataNodes also perform(执行) block creation, deletion, and replication upon instruction(按指令复制) from the NameNode. ![](https://box.kancloud.cn/78957dd1a3c86068730f36c109e2aca7_874x604.png) The NameNode and DataNode are pieces of software designed to run on commodity machines(商品机器). These machines typically run a GNU/Linux operating system (OS). HDFS is built using the Java language; any machine that supports Java can run the NameNode or the DataNode software. Usage of the highly portable Java language means that HDFS can be deployed on a wide range of machines(在广泛机器上). A typical deployment has a dedicated(专用) machine that runs only the NameNode software. Each of the other machines in the cluster runs one instance of the DataNode software. The architecture does not preclude(排除) running multiple DataNodes on the same machine but in a real deployment that is rarely the case. The existence(存在) of a single NameNode in a cluster greatly simplifies(简化) the architecture of the system. The NameNode is the arbitrator(仲裁者) and repository for all HDFS metadata. The system is designed in such a way that user data never flows through the NameNode(用户数据不会通过NmaeNode？). ## The File System Namespace HDFS supports a traditional hierarchical(分级的) file organization. A user or an application can create directories and store files inside these directories. The file system namespace hierarchy(统治集团;层次体系) is similar to most other existing file systems; one can create and remove files, move a file from one directory to another, or rename a file. HDFS supports[user quotas](用户额度？)(http://hadoop.apache.org/docs/stable/hadoop-project-dist/hadoop-hdfs/HdfsQuotaAdminGuide.html)and[access permissions](http://hadoop.apache.org/docs/stable/hadoop-project-dist/hadoop-hdfs/HdfsPermissionsGuide.html). HDFS does not support hard links or soft links（不支持软硬连接）. However, the HDFS architecture does not preclude implementing these features（不排除以后会引进这些特征）. The NameNode maintains the file system namespace. Any change to the file system namespace or its properties is recorded by the NameNode. An application can specify the number of replicas（副本） of a file that should be maintained by HDFS. The number of copies of a file is called the replication factor（副本因子;） of that file. This information is stored by the NameNode. ## Data Replication HDFS is designed to reliably（可靠地） store very large files across machines in a large cluster. It stores each file as a sequence of blocks（一系列块）. The blocks of a file are replicated for fault tolerance（过错容忍）. The block size and replication factor（银子） are configurable per file. All blocks in a file except the last block are the same size（除了最后一块，块都是同样大小的）, while users can start a new block without filling out（填写） the last block to the configured（配置的） block size after the support for variable length block was added to append and hsync（块可变长的支持被添加切同步后）. An application can specify the number of replicas of a file. The replication factor can be specified at file creation time and can be changed later. Files in HDFS are write-once (except for appends and truncates) and have strictly one writer at any time（任何时候都只能有各个写入者）. The NameNode makes all decisions regarding replication of blocks（根据副本做决定）. It periodically（定期地） receives a Heartbeat and a Blockreport from each of the DataNodes in the cluster. Receipt（收据） of a Heartbeat implies（暗示？） that the DataNode is functioning properly. A Blockreport contains a list of all blocks on a DataNode（一个Blockreport 为一个DataNode上的所有块的列表）. ![](https://box.kancloud.cn/ea9dcf6bb9f0ef0c4f1ce67a37f7f6f3_874x536.png) ### Replica Placement（副本放置）: The First Baby Steps(第一个婴儿步) The placement of replicas is critical（至关紧要的;） to HDFS reliability(可靠性) and performance. Optimizing replica placement distinguishes（使有别于） HDFS from most other distributed file systems. This is a feature that needs lots of tuning and experience（调优与经验）. The purpose of a rack-aware（机架感知？） replica placement policy（策略） is to improve data reliability, availability, and network bandwidth utilization（网络带宽利用率）. The current implementation for the replica placement policy is a first effort in this direction（反正当前副本放置策略是朝这个方向努力的啦）. The short-term goals（短期目标） of implementing this policy are to validate it on production systems（短期目标是在生产系统上验证？）, learn more about its behavior, and build a foundation to test and research more sophisticated policies（并为测试和研究更复杂的政策打下基础）. Large HDFS instances run on a cluster of computers that commonly spread across many racks（分布在多个机架上）. Communication between two nodes in different racks has to go through switches（两个节点间的通信必须通过交换机）. In most cases（在大多数案例中）, network bandwidth between machines in the same rack is greater than network bandwidth between machines in different racks（相同机架间机器的网速比不同机架上部署的机器的网速好，你想说啥？？？）. The NameNode determines the rack id each DataNode belongs to via the process outlined in[Hadoop Rack Awareness](http://hadoop.apache.org/docs/stable/hadoop-project-dist/hadoop-common/RackAwareness.html). A simple but non-optimal policy is to place replicas on unique racks（名称节点通过Hadoop Rack Aware中概述的过程确定每个数据节点所属的机架ID。一个简单但非最优的策略是将副本放在唯一的机架上。）. This prevents （放置）losing data when an entire rack fails and allows use of bandwidth from multiple racks when reading data（并允许在读取数据时使用多个机架的带宽）. This policy evenly（均匀地） distributes replicas in the cluster which makes it easy to balance load on component failure. However, this policy increases the cost of writes because a write needs to transfer blocks to multiple racks（这个破策略增加了写数据的负担，因为呀，一个写操作就需要传输数据到多个机架上去）. >还能识别机架？ For the common case, when the replication factor is three（三分）, HDFS’s placement policy is to put one replica on the local machine if the writer is on a datanode, otherwise on a random datanode, another replica on a node in a different (remote) rack（远程机架）, and the last on a different node in the same remote rack（同样的远程机架？）. This policy cuts the inter-rack write traffic（节省了机架间写的流量） which generally improves write performance（所以写的性能就提升了）. The chance of rack failure is far less than that of node failure; this policy does not impact data reliability and availability guarantees（不会影响数据的可靠性和可用性保证）. However, it does reduce the aggregate（总的） network bandwidth used when reading data since a block is placed in only two unique racks rather than（而不是） three. With this policy, the replicas of a file do not evenly（均匀地） distribute across the racks. One third of replicas are on one node（三分之一）, two thirds of replicas are on one rack（三分之二的节点在一个机架上）, and the other third are evenly distributed across the remaining racks. This policy improves write performance without compromising（拖鞋） data reliability（可靠性） or read performance. If the replication factor is greater than 3, the placement of the 4th and following replicas are determined randomly while keeping the number of replicas per rack below the upper limit （在每个机架上线的数量之下，你可以随机放啦）(which is basically(replicas - 1) / racks + 2). Because the NameNode does not allow DataNodes to have multiple replicas of the same block, maximum number of replicas created is the total number of DataNodes at that time（副本的最大数量是DataNodes的数量）. After the support for[Storage Types and Storage Policies](http://hadoop.apache.org/docs/stable/hadoop-project-dist/hadoop-hdfs/ArchivalStorage.html)was added to HDFS（你要增加什么鬼存储类型和存储策略的支持呢？）, the NameNode takes the policy into account for replica placement（**没看明白**） in addition to the rack awareness described above. The NameNode chooses nodes based on rack awareness at first, then checks that the candidate node have storage required by the policy associated with the file（检查候选节点是否具有与文件关联的策略所需的存储-----------害怕空间不够吗）. If the candidate node does not have the storage type（又来存储类型？？）, the NameNode looks for another node. If enough nodes to place replicas can not be found in the first path, the NameNode looks for nodes having fallback storage types（回退存储类型） in the second path. The current, default replica placement policy described here is a work in progress. ### Replica Selection（副本选择） To minimize global bandwidth consumption and read latency（最小化全局带宽消耗和读取延迟）, HDFS tries to satisfy（满足） a read request from a replica that is closest to the reader（离读取这最近）. If there exists a replica on the same rack as the reader node, then that replica is preferred（首选） to satisfy the read request. If HDFS cluster spans multiple data centers（跨越）, then a replica that is resident in the local data center is preferred over any remote replica（首选当前中心）. ### Safemode安全模式 On startup, the NameNode enters a special state called Safemode. Replication of data blocks does not occur （发生）when the NameNode is in the Safemode state. The NameNode receives Heartbeat（心跳） and Blockreport messages from the DataNodes. A Blockreport contains the list of data blocks that a DataNode is hosting. Each block has a specified minimum number of replicas. A block is considered safely replicated when the minimum number of replicas of that data block has checked in with the NameNode. After a configurable percentage of safely replicated data blocks checks in with the NameNode (plus an additional 30 seconds), the NameNode exits the Safemode state. It then determines the list of data blocks (if any) that still have fewer than the specified number of replicas. The NameNode then replicates these blocks to other DataNodes. >没看明白 ## The Persistence of File System Metadata（元数据持久性） The HDFS namespace is stored by the NameNode. The NameNode uses a transaction log called the EditLog to persistently（坚持不懈） record every change that occurs to file system metadata. For example, creating a new file in HDFS causes（导致） the NameNode to insert a record into the EditLog indicating this. Similarly, changing the replication factor of a file causes a new record to be inserted into the EditLog. The NameNode uses a file in its local host OS file system to store the EditLog. The entire file system namespace, including the mapping of blocks to files and file system properties, is stored in a file called the FsImage（文件镜像？）. The FsImage is stored as a file in the NameNode’s local file system too. >元数据都是存储在本地的？ The NameNode keeps an image of the entire（整个） file system namespace and file Blockmap in memory（在内存保留）. When the NameNode starts up（启动）, or a checkpoint is triggered（出发检查点） by a configurable threshold, it reads the FsImage and EditLog from disk, applies all the transactions（应用所有事务？） from the EditLog to the in-memory representation（代表） of the FsImage, and flushes out this new version into a new FsImage on disk. It can then truncate the old EditLog because its transactions have been applied to the persistent FsImage（这个玩意儿事在内存？）. This process is called a checkpoint. The purpose of a checkpoint is to make sure that HDFS has a consistent view（视图） of the file system metadata by taking a snapshot of the file system metadata（获取文件系统元数据的快照） and saving it to FsImage. Even though it is efficient to read a FsImage（即使读取fsimage是有效的）, it is not efficient to make incremental edits directly to a FsImage直接对fsimage进行增量编辑是不有效的（）. Instead of modifying FsImage for each edit, we persist the edits in the Editlog（编辑的其实是Editlog？）. During the checkpoint the changes from Editlog are applied to the FsImage. A checkpoint can be triggered at a given time interval (dfs.namenode.checkpoint.period) expressed in seconds（以秒为单位的时间间隔时，一个检查点会被触发）, or after a given number of filesystem transactions have accumulated（累计） (dfs.namenode.checkpoint.txns). If both of these properties are set, the first threshold to be reached triggers a checkpoint. The DataNode stores HDFS data in files in its local file system. The DataNode has no knowledge about HDFS files（DataNode 对HDFS时没有感知的）. It stores each block of HDFS data in a separate file in its local file system. The DataNode does not create all files in the same directory. Instead, it uses a heuristic（启发式？什么鬼） to determine the optimal（最有） number of files per directory and creates subdirectories appropriately. It is not optimal to create all local files in the same directory because the local file system might not be able to efficiently support a huge number of files in a single directory. When a DataNode starts up, it scans through its local file system, generates a list of all HDFS data blocks that correspond to each of these local files, and sends this report to the NameNode. The report is called the*Blockreport*. ## The Communication Protocols通信协议 All HDFS communication protocols are layered on top of the TCP/IP protocol. A client establishes（建立） a connection to a configurable TCP port on the NameNode machine. It talks the ClientProtocol with the NameNode. The DataNodes talk to the NameNode using the DataNode Protocol. A Remote Procedure Call (RPC) abstraction wraps both the Client Protocol and the DataNode Protocol（远程过程调用抽象包含客户机协议和数据节点协议。）. By design, the NameNode never initiates（发起） any RPCs. Instead, it only responds to RPC requests issued by DataNodes or clients. ## Robustness稳健性 The primary objective of HDFS is to store data reliably even in the presence of failures. The three common types of failures are NameNode failures, DataNode failures and network partitions. >HDFS的主要目标是即使在出现故障时也能可靠地存储数据。三种常见的故障类型是namenode故障、datanode故障和网络分区(网络阻隔吧？？)。 ### Data Disk Failure, Heartbeats and Re-Replication Each DataNode sends a Heartbeat message to the NameNode periodically. A network partition can cause a subset of DataNodes to lose connectivity with the NameNode. The NameNode detects this condition by the absence of a Heartbeat message. The NameNode marks DataNodes without recent Heartbeats as dead and does not forward any new IO requests to them. Any data that was registered to a dead DataNode is not available to HDFS any more. DataNode death may cause the replication factor of some blocks to fall below their specified value. The NameNode constantly tracks(不断追踪) which blocks need to be replicated and initiates replication whenever necessary. The necessity for re-replication may arise due to many reasons: a DataNode may become unavailable, a replica may become corrupted, a hard disk on a DataNode may fail, or the replication factor of a file may be increased. The time-out to mark DataNodes dead is conservatively long (over 10 minutes by default) in order to avoid replication storm caused by state flapping of DataNodes. Users can set shorter interval to mark DataNodes as stale and avoid stale nodes on reading and/or writing by configuration for performance sensitive workloads. ### Cluster Rebalancing集群均很 The HDFS architecture is compatible with data rebalancing schemes. A scheme might automatically move data from one DataNode to another if the free space on a DataNode falls below a certain threshold. In the event of a sudden high demand for a particular file, a scheme might dynamically create additional replicas and rebalance other data in the cluster. These types of data rebalancing schemes are not yet implemented. ### Data Integrity数据完整性 It is possible that a block of data fetched from a DataNode arrives corrupted. This corruption can occur because of faults in a storage device, network faults, or buggy software. The HDFS client software implements checksum checking on the contents of HDFS files. When a client creates an HDFS file, it computes a checksum of each block of the file and stores these checksums in a separate hidden file in the same HDFS namespace. When a client retrieves file contents it verifies that the data it received from each DataNode matches the checksum stored in the associated checksum file. If not, then the client can opt to retrieve that block from another DataNode that has a replica of that block. ### Metadata Disk Failure The FsImage and the EditLog are central data structures of HDFS. A corruption of these files can cause the HDFS instance to be non-functional. For this reason, the NameNode can be configured to support maintaining multiple copies of the FsImage and EditLog. Any update to either the FsImage or EditLog causes each of the FsImages and EditLogs to get updated synchronously. This synchronous updating of multiple copies of the FsImage and EditLog may degrade the rate of namespace transactions per second that a NameNode can support. However, this degradation is acceptable because even though HDFS applications are very data intensive in nature, they are not metadata intensive. When a NameNode restarts, it selects the latest consistent FsImage and EditLog to use. Another option to increase resilience against failures is to enable High Availability using multiple NameNodes either with a[shared storage on NFS](http://hadoop.apache.org/docs/stable/hadoop-project-dist/hadoop-hdfs/HDFSHighAvailabilityWithNFS.html)or using a[distributed edit log](http://hadoop.apache.org/docs/stable/hadoop-project-dist/hadoop-hdfs/HDFSHighAvailabilityWithQJM.html)(called Journal). The latter is the recommended approach. ### Snapshots [Snapshots](http://hadoop.apache.org/docs/stable/hadoop-project-dist/hadoop-hdfs/HdfsSnapshots.html)support storing a copy of data at a particular instant of time. One usage of the snapshot feature may be to roll back a corrupted HDFS instance to a previously known good point in time.