Sunday, May 14, 2006

Lessons learned from the TeraGrid

Cross-posted from The Grid Blog

IBM Developer Works is running a series on "Lessons learned from the TeraGrid". Up till now, two articles belonging to this series have been published:
Even though, the grid computing community is already aware of the TeraGrid, it is a NSF-funded project to provide integrated computational and data infrastructure for research scientists within the United States. The TeraGrid currently delivers more than 50 teraFLOPS of compute power and 600 TB of storage.
TeraGrid sites provide include IBM IA64 clusters, IBM and Dell IA32 clusters, all running Linux®, 32-processor and 8-processor IBM POWER4/AIX® clusters, an IBM Blue Gene®/L rack, a Cray XT3, and SPARC/Solaris nodes. Data resources include at least six IBM General Parallel File System (GPFS) systems and an assortment of other parallel file systems, including file systems based on Parallel Virtual File System 2 (PVFS2), Lustre, Sun Microsystems' QFS, and IBRIX, along with several Mass Storage Systems (MSS) and Hierarchical Storage Management (HSM) systems.

The first article looks at the use of grid computing to enable large scale resource sharing throughout the TeraGrid transparently and easilly.
This article introduced the TeraGrid, currently the largest set of public high-end computational resources in the United States. It described the motivations behind the project and briefly introduced some of the challenges inherent in managing a large geographically distributed grid. A wide variety of strategies and tools is required to address these challenges, and this article provides an overview of some of the most significant ways in which the TeraGrid project has overcome these challenges.

The second article focuses on the two main data management approaches used in the TeraGrid. The two are GridFTP and the GPFS-WAN file system. GridFTP as we know is a set of extensions to the FTP protocol used to transfer files between the grid nodes. The GPFS-WAN file system, based on IBM's General Parallel File System, supports parallel file system within the Grid nodes. The various security and other flexible features of the system are of specific intrerest. As per the information in teh article the TeraGrid team was allocated 64 server nodes for file system and 6 server nodes for performing metadata operations.

It was the first time, I read about this, so let me quote a few sentences from the article.
two primary challenges must be overcome when providing a parallel file system on a grid: authentication and user-identity mapping; and providing a high level of performance under heavy load when potentially thousands of machines could be accessing the same file system simultaneously.

For GPFS-WAN, the TeraGrid team began by allocating a large number of disks to the file system (twice the amount available to any other high-performance file system), ensuring that the underlying storage would be more than sufficient to fill the available bandwidth.

No comments: