Total Technical Services

Responsible for maintaining three heterogeneous clusters in a research environment using xCAT 2.x for node deployment and Platform LSF for scheduling. Developed multiple images for each cluster based on the needs of the research team. Responsible for maintaining cluster hardware and software infrastructure including storage, Infiniband, 10GB and 1GB Ethernet. Developed KPI reporting for upper management. Implemented Platform RTM for cluster performance monitoring. Helped design and implement a 40 node GPU cluster and integrated it into the existing xCAT and LSF systems. Designed and deployed a cluster health monitoring system using Icinga. Participated in maintaining DDN Lustre, NetApp, and iSilon storage systems. Participated in upgrading storage capacity and infrastructure for both DDN and Isilon.

Leave a Comment

Your email address will not be published.