Cloud File System
Cloud file storage (CFS) is a storage
service that is delivered over the Internet, billed on a pay-per-use
basis and has an architecture based on common file level protocols such
as Server Message Block (SMB), Common Internet File System (CIFS) and Network File System (NFS).
Difference between GFS and HDFS
Property
|
DFS
|
HDFS
|
Design Goals
|
The main goal of GFS is
to support large files
Built
based on the assumption that terabyte data sets will be distributed across
thousands of disks attached to commodity compute nodes.
Used
for data intensive computing .
Store
data reliably, even when failures occur within chunk servers, master, or
network partitions.
GFS
is designed more for batch processing rather than interactive use by users.
|
One of the main goals of
HDFS is to support large files.
Built
based on the assumption that terabyte data sets will be distributed across
thousands of disks attached to commodity compute nodes.
Used
for data intensive computing .
Store
data reliably, even when failures occur within name nodes, data nodes, or
network partitions.
HDFS
is designed more for batch processing rather than interactive use by users.
|
Processes
|
Master and chunk server
|
Name node and Data node
|
File Management
|
Files are organized
hierarchically in directories and identified by path names.
GFS
is exclusively for Google only.
|
HDFS supports a
traditional hierarchical file organization
HDFS
also supports third-party file systems such as CloudStore and Amazon Simple
Storage Service.
|
Scalability
|
Cluster based architecture
The file system consists of hundreds or even thousands
of storage machines built from inexpensive commodity parts.
The largest cluster have over 1000 storage nodes, over
300 TB of disk storage, and are heavily accessed by hundreds of clients on
distinct machines on a continuous basis.
|
Cluster based
architecture
Hadoop currently runs on clusters with
thousands of nodes.
E.g. Face book has 2 major clusters:
- A 1100-machine
cluster with 8800
cores and about 12PB raw storage.
- A 300-machine
cluster with 2400
cores and about 3PB raw storage.
- Each (commodity)
node has 8 cores
and 12 TB of storage.
EBay uses 532 nodes cluster (8*532 cores, 5.3PB)
Yahoo uses more than 100,000 CPUs in >40,000
computers running Hadoop
- biggest cluster: 4500 nodes(2*4cpu boxes w 4*1TB disk
& 16GB RAM)
K.Talattinis
et.al concluded in their work that Hadoop is really efficient while running
in a fully distributed mode, however in order to achieve optimal results and
get advantage of Hadoop scalability, it is necessary to use large clusters of
computers
|
Protection
|
Google have their own file system called
GFS. With GFS, files are split up and stored in multiple pieces on multiple
machines.
Filenames are random (they do not match
content type or owner). There are hundreds of thousands of files on a single
disk, and all the data is obfuscated so that it is not human readable. The
algorithms uses for obfuscation changes all the time
|
The HDFS implements a
permission model for files and directories that shares much of the POSIX
model.
File or directory has separate permissions
for the user that is the
owner, for other users that are members of the group, and for all other users
|
Security
|
Google has
dozens of datacenters for redundancy. These datacenters are in undisclosed
locations and most are unmarked for protection.
Access is allowed to authorized employees and vendors
only. Some of the protections in place include: 24/7 guard coverage,
Electronic key access, Access logs, Closed circuit televisions, Alarms linked
to guard stations, Internal and external patrols, Dual utility power feeds
and Backup power UPS and generators
|
HDFS security is based on the POSIX model
of users and groups.
Currently is security is limited to simple file
permissions.
The identity of a client process is just whatever the
host operating system says it is.
Network authentication protocols like Kerberos for user
authentication and encryption of data transfers are yet not supported
|
Database Files
|
Bigtable
is the database used by GFS. Bigtable is a proprietary distributed database
of Google Inc.
|
HBase
provides Bigtable (Google) -like capabilities on top of Hadoop Core.
|
File Serving
|
A file in GFS is
comprised of fixed sized chunks. The size of chunk is 64MB. Parts of a file
can be stored on different nodes in a cluster satisfying the concepts load
balancing and storage management.
|
HDFS is divided into
large blocks for storage and access, typically 64MB in size. Portions of the
file can be stored on different cluster nodes, balancing storage resources
and demand
|
Cache Management
|
Clients do cache metadata.
Neither
the sever nor the client caches
the file data.
Chunks
are stored as local files in a Linux system. So, Linux buffer cache already
keeps frequently accessed data in memory. Therefore chunk servers need not
cache file data.
|
HDFS uses distributed
cache
It
is a facility provided by Mapreduce framework to cache application-specific,
large, read-only files (text, archives, jars and so on)
Private
(belonging to one user) and Public (belonging to all the user of the same
node) Distributed Cache Files
|
Cache
Consistency
|
Append-once-read-many model is adapted by
Google. It avoids the locking mechanism of files for writing in distributed
environment is avoided.
Client can append the data to the existing file.
|
HDFS’s
write-once-read-many model that relaxes concurrency control requirements,
simplifies data coherency, and enables high throughput access.
Client can only append to existing files (yet not
supported)
|
Communication
|
Chunk replicas are spread
across the racks. Master automatically replicates the chunks.
A
user can specify the number of replicas to be maintained.
The master re-replicates a chunk replica as
soon as the number of available
replicas falls below a user-specified
number.
|
Automatic replication system.
Rack
based system. By default two copies of each block are stored by different
Data Nodes in the same rack and a third copy is stored on a Data Node in a
different rack ( for greater reliability).
An
application can specify the number of replicas of a file that should be
maintained by HDFS .
Replication
pipelining in case of write operations.
|
Available
Implementation
|
GFS is a proprietary
distributed file system developed by Google for its own use.
|
Yahoo, Facebook, IBM etc.
are based on HDFS.
|
This blog is more effective and it is very much useful for me.we need more information please keep update more.
ReplyDeleteHadoop Training in Chennai
Big data training in chennai
Big Data Training in Anna Nagar
JAVA Training in Chennai
Python Training in Chennai
Android Training in Chennai
hadoop training in Annanagar
big data training in chennai anna nagar
Big data training in annanagar
Thank you ❣️
ReplyDelete