GFS vs HDFS

Cloud File System


Cloud file storage (CFS) is a storage service that is delivered over the Internet, billed on a pay-per-use basis and has an architecture based on common file level protocols such as Server Message Block (SMB), Common Internet File System (CIFS) and Network File System (NFS).

Difference between GFS and HDFS


Property
DFS
HDFS
Design Goals
       The main goal of GFS is to support large files
       Built based on the assumption that terabyte data sets will be distributed across thousands of disks attached to commodity compute nodes.
       Used for data intensive computing .
       Store data reliably, even when failures occur within chunk servers, master, or network partitions.
       GFS is designed more for batch processing rather than interactive use by users.

       One of the main goals of HDFS is to support large files.
       Built based on the assumption that terabyte data sets will be distributed across thousands of disks attached to commodity compute nodes.
       Used for data intensive computing .
       Store data reliably, even when failures occur within name nodes, data nodes, or network partitions.
       HDFS is designed more for batch processing rather than interactive use by users.

Processes
Master and chunk server

Name node and Data node

File Management
       Files are organized hierarchically in directories and identified by path names.
       GFS is exclusively for Google only.

       HDFS supports a traditional hierarchical file organization
       HDFS also supports third-party file systems such as CloudStore and Amazon Simple Storage Service.

Scalability
       Cluster based architecture
       The file system consists of hundreds or even thousands of storage machines built from inexpensive commodity parts.
       The largest cluster have over 1000 storage nodes, over 300 TB of disk storage, and are heavily accessed by hundreds of clients on distinct machines on a continuous basis.

     Cluster based architecture
       Hadoop currently runs on clusters with thousands of nodes.
       E.g. Face book has 2 major clusters:
- A 1100-machine cluster with 8800 cores and about 12PB raw storage.
- A 300-machine cluster with 2400 cores and about 3PB raw storage.
- Each (commodity) node has 8 cores and 12 TB of storage.
       EBay uses 532 nodes cluster (8*532 cores, 5.3PB)
       Yahoo uses more than 100,000 CPUs in >40,000 computers running Hadoop - biggest cluster: 4500 nodes(2*4cpu boxes w 4*1TB disk & 16GB RAM)
         K.Talattinis et.al concluded in their work that Hadoop is really efficient while running in a fully distributed mode, however in order to achieve optimal results and get advantage of Hadoop scalability, it is necessary to use large clusters of computers

Protection
Google have their own file system called GFS. With GFS, files are split up and stored in multiple pieces on multiple machines.
Filenames are random (they do not match content type or owner). There are hundreds of thousands of files on a single disk, and all the data is obfuscated so that it is not human readable. The algorithms uses for obfuscation changes all the time


       The HDFS implements a permission model for files and directories that shares much of the POSIX model.
  File or directory has separate permissions for the user that is the owner, for other users that are members of the group, and for all other users


Security
     Google has dozens of datacenters for redundancy. These datacenters are in undisclosed locations and most are unmarked for protection.
       Access is allowed to authorized employees and vendors only. Some of the protections in place include: 24/7 guard coverage, Electronic key access, Access logs, Closed circuit televisions, Alarms linked to guard stations, Internal and external patrols, Dual utility power feeds and Backup power UPS and generators

       HDFS security is based on the POSIX model of users and groups.
       Currently is security is limited to simple file permissions.
       The identity of a client process is just whatever the host operating system says it is.
       Network authentication protocols like Kerberos for user authentication and encryption of data transfers are yet not supported

Database Files
Bigtable is the database used by GFS. Bigtable is a proprietary distributed database of Google Inc.

HBase provides Bigtable (Google) -like capabilities on top of Hadoop Core.

File Serving
A file in GFS is comprised of fixed sized chunks. The size of chunk is 64MB. Parts of a file can be stored on different nodes in a cluster satisfying the concepts load balancing and storage management.


HDFS is divided into large blocks for storage and access, typically 64MB in size. Portions of the file can be stored on different cluster nodes, balancing storage resources and demand

Cache Management
       Clients do cache metadata.
       Neither the sever nor the client caches the file data.
       Chunks are stored as local files in a Linux system. So, Linux buffer cache already keeps frequently accessed data in memory. Therefore chunk servers need not cache file data.

       HDFS uses distributed cache
       It is a facility provided by Mapreduce framework to cache application-specific, large, read-only files (text, archives, jars and so on)
       Private (belonging to one user) and Public (belonging to all the user of the same node) Distributed Cache Files

Cache Consistency

      Append-once-read-many model is adapted by Google. It avoids the locking mechanism of files for writing in distributed environment is avoided.
       Client can append the data to the existing file.

     HDFS’s write-once-read-many model that relaxes concurrency control requirements, simplifies data coherency, and enables high throughput access.
       Client can only append to existing files (yet not supported)

Communication
       Chunk replicas are spread across the racks. Master automatically replicates the chunks.
       A user can specify the number of replicas to be maintained.
  The master re-replicates a chunk replica as soon as the number of available replicas falls below a user-specified number.


      Automatic replication system.
       Rack based system. By default two copies of each block are stored by different Data Nodes in the same rack and a third copy is stored on a Data Node in a different rack ( for greater reliability).
     An application can specify the number of replicas of a file that should be maintained by HDFS .
       Replication pipelining in case of write operations.


Available Implementation

GFS is a proprietary distributed file system developed by Google for its own use.

Yahoo, Facebook, IBM etc. are based on HDFS.


2 comments: