NextIO’s vCORE: Consolidates and optimizes GPU operations
A boon for scientific and technical computing
By Deni Connor
Principal Analyst, Storage Strategies NOW
May 2010
Introduction
Originally used for 3D gaming acceleration, GPUs have now come to the forefront of use for scientific computing, financial services’ Monte Carlo simulations, oil and gas exploration and pattern recognition, among other applications – all embarrassing parallel workloads involving very large datasets and calling for floating point operations – that can be broken down into parallel processes and thus, saving organizations time and money.
Despite all the advantages of using GPUs for technical and scientific computing, also comes a wealth of challenges, mostly centered upon GPU maintenance and replacement. When workloads are broken up so that the sequential part of the application runs on the CPU and the computationally-intensive part runs on the GPU, if the GPU requires replacement, all workloads running on the server containing the GPU must be stopped, the server downed and the GPU replaced. The same thing happens when new firmware needs to be loaded or when a new version of the GPU needs to be added to the system – operations stop while labor-intensive and high touch-point GPU maintenance occurs. This in turn impacts all the applications running on the server, not just that application running on the failed GPU – an event that delays an organization’s ability to do its work.
As the number of jobs and GPUs increase, so do the dependency and potential for resource contention. Jobs that don’t require GPUs for computation may be assigned to servers containing GPUs, while workloads requiring GPUs wait for them to be available.
The result of GPU contention and maintenance factors is that system administrators tend to over-provision GPUs to ensure that one is always available when needed for a planned workload. This approach is clearly problematic – TCO increases, labor intensifies and ROI decreases.
Imagine if you could look at GPUs as a shared pool of resources – a concentrator or consolidator of sorts – that would allow you to assign GPUs to workloads and servers on demand and manage and configure GPUs centrally. Is that a pipe dream or reality?
Enter vCore
NextIO introduced the vCORE consolidation appliance recently – a shared resource appliance that congregates GPUs in a single external enclosure. The vCORE appliance consists of a 4U (7-inch) high, 20-inch deep enclosure containing either eight double-wide or 16 single-wide GPUs, which can be connected to servers, where it is able to consolidate GPUs in an industry standards-based fashion. The appliance at this time accommodates either NVIDIA Tesla or Quadro GPUs, although in the future will work with GPUs from other vendors. It attaches to any x86-based blade or rack-mounted server via the PCIe bus. Having the CPUs
Figure 1. The vCORE consolidation appliance
concentrated in a single enclosure also provides for investment protection and future-proofing – GPUs can be
dynamically added to the system as needed or replaced when failures occur without affecting workloads running on other GPUs.
The vCORE consolidation appliance has 3+1 fan cooling and a 2400W power and cooling capacity. Carrier cards are incorporated into the appliance for hot-swappable GPU replacement. Further, the appliance can be managed remotely via the included nConnect management software, which offers a GUI, command line or third-party API. The vCORE appliance also supports the I/O Virtualization specification for Fibre Channel and Gbit Ethernet, which allows multiple operating systems to run simultaneously within a single computer and natively share PCIe devices.
Finally, the vCORE appliance is available in several models, which vary by the number of GPUs, servers and network connections they support.
Figure 2. vCORE models
Impressive in operations, the appliance delivers 78 to 630GFLOPs of performance per GPU and supports 240 to 512 server cores per GPU.
vCORE serviceability, configurability and management
With the vCORE appliance also comes easy serviceability and configurability. When a GPU needs to be replaced or upgraded, applications assigned to other GPUs can continue to run. The affected GPU is taken out of service by removing it from its hot-pluggable sled, the new GPU is inserted without impact to other running applications, and the new GPU is brought online and the application is restarted.
Management of GPUs is also centralized with the vCORE appliance. GPUs can be shut down from the console, removed physically from the enclosure and when newly repositioned in the chassis, they’ll appear on the management console once again, where they can be re-enabled and workloads restarted. The versatility of the nControl management software is also apparent – administrators can interface with the vCORE appliance via a Web-based GUI, a command line interface or open APIs to third-party workload schedulers such as the Moab Cluster Suite or TORQUE Resource Manager.
Total cost of ownership analysis
In this example, the customer requirement is 309 double precision Tflops, and assumes $0.11 per KWH. In order to meet the requirement with x86 servers, a total of 4,217 high-end servers are required, along with over 100 48 port switches. This many servers uses nearly three million dollars per year in energy costs alone, not counting cooling expense. This assumes 100% duty cycle.
| Configuration | Total
Servers |
Cisco
Switches |
vCORE | Capital
Expense |
Tflops | Energy cost per year |
Total Cost
3 years |
Savings |
| x86 only | 4,217 | 101 | 0 | $13.358M | 309.7 | $2.873M | $21.979M | 0 |
| x86 + vCORE | 152 | 15 | 76 | $3.600M | 324.4 | $0.248M | $4.345M | $17.633M |
SSG-NOW Assessment
The NextIO vCORE brings large scale GPU computing into the enterprise. With its centralized management on consolidation of 8 or 16 GPUs, the vCORE appliance is a forerunner for eased technical and scientific computing. Its serviceability and configurability is the answer to GPU maintenance and replacement and its rack-mounted approach makes it easy to install in existing data center rack configurations. Because the vCORE appliance accommodates GPUs and servers from any vendor, it allows users to avoid vendor lock-in, increase their TCO and optimize performance as the GPU market advances.
Note: The information and recommendations made by Storage Strategies NOW are based upon public information and sources and may also include personal opinions both of Storage Strategies NOW and others, all of which we believe to be accurate and reliable. As market conditions change however, and not within our control, the information and recommendations are made without warranty of any kind. All product names used and mentioned herein are the trademarks of their respective owners. Storage Strategies NOW, Inc. assumes no responsibility or liability for any damages whatsoever (including incidental, consequential or otherwise), caused by your use of, or reliance upon, the information and recommendations presented herein, nor for any inadvertent errors which may appear in this document.
Leave a Reply




