Note: Publications prior to Sept 2011 refer to a different and now deprecated architecture for data collection and transport (i.e., they do NOT use LDMS).
2023
Prodigy: Toward Unsupervised Anomaly Detection in Production HPC Systems
Burak Aksar, Efe Sencan, Benjamin Schwaller, Omar Aaziz, Vitus J. Leung, Jim Brandt, Brian Kulis, Manuel Egele, and Ayse K. Coskun
SC
`23: Proc of the Int’l Conf. for High Performance Computing, Networking, Storage and Analysis.
Nov 2023, to appear.
Autonomy Loops for Monitoring, Operational Data Analytics, Feedback, and Response in HPC Operations
Francieli Boito, Jim Brandt, Valeria Cardellini, Philip Carns, Florina M. Ciorba, Hilary Egan, Ahmed Eleliemy, Ann Gentile, Thomas Gruber, Jeff Hanson, Utz-Uwe Haus, Kevin Huck, Thomas Ilsche, Thomas Jakobsche, Terry Jones, Sven Karlsson, Abdullah Mueen, Michael Ott, Tapasya Patki, Krishnan Raghavan, Stephen Simms, Kathleen Shoga, Michael Showerman, Devesh Tiwari, Torsten Wilde, Ivy Peng, and Keiji Yamamoto
Proc. IEEE Cluster
Oct 2023, to appear
Towards Practical Machine Learning Frameworks for Performance Diagnostics in Supercomputers
B. Aksar, E. Sencan, B. Schwaller, V. Leung, J. Brandt, B. Kulis, M. Egele, and A. Coskun
AI4Sys ‘23: Proceedings of the First Workshop on AI for Systems, August 2023, pp 1–6
Evaluating and Influencing Extreme-Scale Monitoring Implementations
Jim Brandt, Chris Morrone, Eric Roman, Ann Gentile, Tom Tucker, Jeff Hanson, Kathleen Shoga, and Alec Scott
Proc Cray User’s Group
May 2023.
Driving HPC Operations With Holistic Monitoring and Operational Data Analytics (Dagstuhl Seminar 23171)
Jim Brandt, Florina Ciorba, Ann Gentile, Michael Ott, and Torsten Wilde
In Dagstuhl Reports, Volume 13, Issue 4, pp. 98-120, Schloss Dagstuhl - Leibniz-Zentrum für Informatik (2023)
Community Readiness and Opportunities for Progress in HPC Monitoring, Analysis, Feedback, and Response – Keynote
J. Brandt
Apr 2023
2021
Systematically Inferring I/O Performance Variability by Examining Repetitive Job Behavior
E. Costa, T. Patel, B. Schwaller, J. Brandt, D. Tiwari
SC ‘21: Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis, November 2021, Article No.: 33, Pages 1–15
Backfilling HPC Jobs with a Multimodal-Aware Predictor
K. Lamar, A. Goponenko, C. Peterson, B. Allan, J. Brandt, and D. Dechev
2021 IEEE International Conference on Cluster Computing (CLUSTER), Portland, OR, USA, 2021, pp. 618-622
Using Monitoring Data to Improve HPC Performance via Network-Data-Driven Allocation
Y. Zhang, B. Aksar, O. Aaziz, B. Schwaller, J. Brandt, V. Leung, M. Egele, and A. Coskun
2021 IEEE High Performance Extreme Computing Conference (HPEC), Waltham, MA, USA, 2021, pp. 1-7
Integrating Systems Operations into CoDesign – Keynote
Presented by A. Gentile
A. Patke, S. Jha, H. Qui, J. Brandt, A. Gentile, J. Greenseid, A. Kalbarczyk, and R. Iyer
ICS ‘21: Proceedings of the ACM International Conference on Supercomputing, June 2021, Pages 342–353
Presented by A. Gentile
2021 ECP Annual Meeting Center and Application Monitoring WG. Apr 2021.
2020
HPC System Data Pipeline to Enable Meaningful Insights through Analytic-Driven Visualizations
B. Schwaller, N. Tucker, T. Tucker, B. Allan, and J. Brandt
in 2020 IEEE International Conference on Cluster Computing (CLUSTER), Kobe, Japan, 2020 pp. 433-441.
Towards Workload-Adaptive Scheduling for HPC Clusters
A. Goponenko, R. Izadpanah, J. Brandt, and D. Dechev
2020 IEEE International Conference on Cluster Computing (CLUSTER), Kobe, Japan, 2020, pp. 449-453
LDMS Monitoring of EDR InfiniBand Networks – workshop work-in-progress paper & presentation
B. Allan, M. Aguilar, B. Schwaller, S. Langer
2020 IEEE International Conference on Cluster Computing (CLUSTER), Kobe, Japan, 2020, pp. 459-463
Also as Sandia Technical Report SAND2020-8534C (paper) and SAND2020-9599C (presentation).
Inspecting fast commodity RDMA network performance on production systems with LDMS – Workshop presentation
B. Allan, M. Aguilar, B. Schwaller, S. Langer
Sandia Technical report SAND2020-8014C.
Production LDMS, genders, systemd, and the future – Workshop presentation
B. Allan
Sandia Technical report SAND2020-8015C.
LDMS packaging: Moving from tribal knowledge to community knowledge – Workshop presentation
B. Allan
Sandia Technical report SAND2020-8013C.
ALAMO: Autonomous Lightweight Allocation, Management, and Optimization
R. Brightwell, K. B. Ferreira, R. E. Grant, S. Levy, J. Lofstead, S. L. Olivier, K. T. Pedretti, A. J. Younge, A. Gentile, and J. Brandt.
In: Nichols J., Verastegui B., Maccabe A., Hernandez O., Parete-Koon S., Ahearn T. (eds) Driving Scientific and Engineering Discoveries Through the Convergence of HPC, Big Data and AI.
Smoky Mountains Computational Sciences and Engineering Conference (SMC2020) Communications in Computer and Information Science, vol 1315. Springer, Cham., 2020.
S. Jha, A. Gentile, J. Brandt, A. Patke, B. Lim, G. Bauer, M. Showerman, L. Kaplan, Z. Kalbarczyk, W. Kramer, and R. Iyer
Attributing Performance Variation from Integrated Application and System Data – poster
O. Aaziz, B. Allan, J. Brandt, J. Cook., K. Devine, J. Elliott, A. Gentile, S. Olivier, K. Pedretti, and T. Tucker
Applied Computer Science Meeting, Feb 2020.
2019
Enabling Machine Learning-based HPC Performance Diagnostics in Production Environments – Panel Organizer
Organizers: M. Showerman, J. Greenseid, A. Gentile, and J. Brandt
Panelists: W. T. Kramer (NCSA), R. Gerber (NERSC), N. Brown (EPCC), and A. Saxton (NCSA)
SC19, Fri 11/22 8:30 AM Nov 2019
Holistic Measurement Driven System Assessment (HMDSA) – poster
S. Jha, M. Showerman, A. Saxton, J. Enos, G. Bauer, Z. Kalbarczyk, A. Gentile, J. Brandt, R. Iyer, and W. T. Kramer
A Machine Learning Approach to Understanding HPC Application Performance Variation – poster
B. Aksar, B. Schwaller, O. Aaziz, E. Ates, J. Brandt, A. K. Coskun, M. Egele, and V. Leung
LDMS v4: Writing Sampler and Store Plugins
A. Gentile
Sandia National Laboratories, SAND2019-12858 O, Oct 2019.
Figures of merit for production HPC
B. Allan
Sandia National Laboratories, SAND2019-12564, Oct. 2019.
Proxy or Imposter? A Method and Case Study to Determine the Answer
O. Aaziz, J. Cook, C. Vaughan, and D. Richards
2019 IEEE International Conference on Cluster Computing (CLUSTER), Albuquerque, NM, USA, 2019, pp. 1-9
Standardized Environment for Monitoring Heterogeneous Architectures
C. Brown, B. Schwaller, N. Gauntt, B. Allan and K. Davis
2019 IEEE International Conference on Cluster Computing (CLUSTER), Albuquerque, NM, USA, 2019, pp. 1-5
A Study of Network Congestion in Two Supercomputing High-Speed Interconnects
S. Jha, A. Patke, J. Brandt, A. Gentile, M. Showerman, E. Roman, Z. Kalbarczyk, and R. Iyer
in 2019 IEEE Symposium on High-Performance Interconnects (HOTI), Santa Clara, CA, USA, 2019, pp. 45-48
B. Allan
Sandia National Laboratories, SAND2019-10266C, Aug. 2019.
HPAS: An HPC Performance Anomaly Suite for Reproducing Performance Variations
E. Ates, Y. Zhang, B. Aksar, J. Brandt, V. J. Leung, M. Egele, and A. K. Coskun
ICPP ‘19: Proceedings of the 48th International Conference on Parallel Processing, August 2019, Article No.: 40, Pages 1–10
Production Application Performance Data Streaming for System Monitoring
R. Izadpanah, B. Allan, D. Dechev, and J. Brandt
ACM Transactions on Modeling and Performance Evaluation of Computing Systems (TOMPECS). Vol 4 Issue 2, Article No.: 8, pp 1–25, 2019
Exploring New Monitoring and Analysis Capabilities on Cray’s Software Preview System
J. Brandt, C. Brown, S. Donoho, A. Gentile, J. Greenseid, W. Kramer, P. Langer, A. Rashid, K. Rehm, and M. Showerman
Extracting Actionable System-Application Performance Factors
J. Brandt, A. Gentile, and J. Cook
Minisymposium on Modeling Resource Utilization and Contention in HPC System-Application Interactions – Minisymposium Organizer
Holistic Measurement Driven System Assessment (HMDSA) –
poster
Bill Kramer, Greg Bauer, Brett Bode, Mike Showerman, Jeremy Enos, Aaron Saxton, Saurabh Jha, Zbigniew Kalbarczyk, and Ravishankar Iyer (NCSA/UIUC) and James Brandt and Ann Gentile (SNL)
Two Weeks In The Life of Skybridge – SLURM and LDMS metrics and metadata.
B. Allan
Sandia National Laboratories SAND 2019-4915, April 2019.
2018
Platform Independent Run Time HPC Monitoring, Analysis, and Feedback at Any-Scale – Featured Presentation at DOE Booth
J. Brandt
SC18, Nov 2018.
Monitoring Large-Scale HPC Systems: Extracting and Presenting Meaningful System and Application Insights – BoF Session Organizer
An Efficient Latch-free Database Index Based on Multi-dimensional Lists
K. Lamar, R. Izadpanah, J. Brandt, and D. Dechev
2018 IEEE 37th International Performance Computing and Communications Conference (IPCCC), Orlando, FL, USA, 2018, pp. 1-2
Online Diagnosis of Performance Variation in HPC Systems Using Machine Learning
O. Tuncer, E. Ates, Y. Zhang, A. Turk, J. Brandt, V. Leung, M.Egele, and A. Coskun
IEEE Transactions on Parallel and Distributed Systems
A Methodology for Characterizing the Correspondence Between Real and Proxy Applications
O. Aaziz, J.M. Cook, J. Cook, T. Juedeman, D. Richards, and C. Vaughan
2018 IEEE International Conference on Cluster Computing (CLUSTER), Belfast, UK, 2018, pp. 190-200
Large-Scale System Monitoring Experiences and Recommendations –
Invited Peer-Reviewed Submission at HPCMASPA
V. Ahlgren, S. Andersson, J. Brandt, N. P. Cardo, S. Chunduri, J. Enos, P. Fields, A. Gentile, R. Gerber, M. Gienger, J. Greenseid, A. Greiner, B. Hadri, Y. (Helen) He, D. Hoppe, U. Kaila, K. Kelly, M. Klein, A. Kristiansen, S. Leak, M. Mason, K. Pedretti, J-G. Piccinali, J. Repik, J. Rogers, S. Salminen, M. Showerman, C. Whitney, and J. Williams (Authors representing ALCF, CSC, CSCS, HLRS, KAUST, LANL, NCSA, NERSC, ORNL, SNL, and Cray)
2018 IEEE International Conference on Cluster Computing (CLUSTER), Belfast, UK, 2018, pp. 532-542
Characterizing Supercomputer Traffic Networks Through Link-Level Analysis
S. Jha, J. Brandt, A. Gentile, Z. Kalbarczyk, and R. Iyer
2018 IEEE International Conference on Cluster Computing (CLUSTER), Belfast, UK, 2018, pp. 562-570
Modeling Expected Application Runtime for Characterizing and Assessing Job Performance
O. Aaziz, J. Cook, and M. Tanash
2018 IEEE International Conference on Cluster Computing (CLUSTER), Belfast, UK, 2018, pp. 543-551
Taxonomist: Application Detection through Rich Monitoring Data – Best Artifact Award
E. Ates, O. Tuncer, A. Turk, V. J. Leung, J. Brandt, M. Egele and A. K. Coskun
Euro-Par 2018: Parallel Processing: 24th International Conference on Parallel and Distributed Computing, Turin, Italy, August 27 - 31, 2018, Pages 92–105
Integrating Low-latency Analysis into HPC System Monitoring
R. Izadpanah, N. Naksinehaboon, J. Brandt, A. Gentile, and D. Dechev
ICPP ‘18: Proceedings of the 47th International Conference on Parallel Processing, August 2018, Article No.: 5, Pages 1–10
Cray System Monitoring: Successes, Requirements, Priorities
V. Ahlgren, S. Andersson, J. Brandt, N. P. Cardo, S. Chunduri, J. Enos, P. Fields, A. Gentile, R. Gerber, J. Greenseid, A. Greiner, B. Hadri, Y. He, D. Hoppe, U. Kaila, K. Kelly, M. Klein, A. Kristiansen, S. Leak, M. Mason, K. Pedretti, J-G. Piccinali, J. Repik, J. Rogers, S. Salminen, M. Showerman, C. Whitney, and J. Williams. (Authors representing ALCF, CSC, CSCS, HLRS, KAUST, LANL, NCSA, NERSC, ORNL, SNL, and Cray)
Supporting Failure Analysis with Discoverable, Annotated Log Datasets
S. Leak, A. Greiner, A. Gentile, and J. Brandt
Automated Analysis and Effective Feedback – BOF Session Organizer
M. Showerman, J. Brandt, and A. Gentile
Runtime HPC System and Application Performance Assessment and Diagnostics
J. Brandt, A. Gentile, Jon Cook, B. Allan, Jeanine Cook, O. Aaziz, T. Tucker, N. Naksinehaboon, N. Taerat, E. Ates, O. Tuncer, M. Egele, A. Turk, and A. Coskun
Continuous Performance Tracking for Kokkos using LDMS
J. Brandt, S. Hammond, T. Tucker, A. Gentile, and J. Cook
Programming Models and CoDesign Meeting, Albuquerque, NM. Feb 2018.
2017
Systems Monitoring Data in Action – BoF Session Organizer
SC17, 12:15pm-1:15 pm Thurs Nov 16 2017.
Holistic Measurement Driven System Assessment
S. Jha, J. Brandt, A. Gentile, Z. Kalbarczyk, G. Bauer, J. Enos, M. Showerman, L. Kaplan, B. Bode, A. Greiner, A. Bonnie, M. Mason, R. Iyer, and W. Kramer
2017 IEEE International Conference on Cluster Computing (CLUSTER), Honolulu, HI, USA, 2017, pp. 797-800
Diagnosing Performance Variations in HPC Applications Using Machine Learning – Gauss Award Winner
O. Tuncer, E. Ates, Y. Zhang, A. Turk, J. Brandt, V. J. Leung, M. Egele, and A. K. Coskun
High Performance Computing: 32nd International Conference, ISC High Performance 2017, Frankfurt, Germany, June 18–22, 2017, Pages 355–373
LDMS Version 3 Tutorial and Demo Material – (NB: Deprecated)
J. Brandt, T. Tucker, A. Gentile, N. Naksinehaboon, and N. Taerat
Sandia National Laboratories, SAND2017-5153 O, May 2017.
Understanding Fault Scenarios and Impacts Through Fault Injection Experiments in Cielo
V. Formicola, S. Jha, F. Deng, D. Chen (UIUC), A. Bonnie, M. Mason (LANL), J. Brandt, A. Gentile (SNL), L. Kaplan, J. Repik (Cray), J, Enos, M. Showerman (NCSA), A. Greiner (NERSC), Z. Kalbarczyk, R. Iyer, and W. Kramer (UIUC)
Runtime Collection and Analysis of System Metrics for Production Monitoring of Trinity Phase II
A. DeConinck, H. Nam, D. Morton, A. Bonnie, C. Lueninghoener (LANL), J. Brandt, A. Gentile, K. Pedretti, A. Agelastos, C. Vaughan, S. Hammond, B. Allan (SNL), M. Davis and J. Repik (Cray)
Holistic Systems Monitoring and Analysis – BOF Session Organizer
M. Showerman, J. Brandt, and A. Gentile
Contention and Congestion: Challenges and Approaches to Understanding Application Impact
A. Gentile, J. Brandt, A. Agelastos, and J. Lamb, K. Ruggirello, and J. Stevenson
2016
SC16, Fri 18th Nov 2016 10:30-noon.
Monitoring Large Scale HPC Systems: Understanding, Diagnosis and Attribution of Performance Variation and Issues – BoF Session Organizer
SC16, 5:15pm-7pm Wed Nov 16 2016.
Discovery, Interpretation, and Communication of Meaningful Information in HPC Monitoring Data
Holistic Measurement Driven Resilience
Continuous Whole-System Monitoring Toward Rapid Understanding of Production HPC Applications and Systems
A. Agelastos, B. Allan, J. Brandt, A. Gentile, S. Lefantzi, S. Monk, J. Ogden, M. Rajan, and J. Stevenson
Large-Scale Persistent Numerical Data Source Monitoring System Experiences
J. Brandt, A. Gentile, M. Showerman, J. Enos, J. Fullop, and G. Bauer
2016 IEEE International Parallel and Distributed Processing Symposium Workshops (IPDPSW), Chicago, IL, USA, 2016, pp. 1711-1720
Design and Implementation of a Scalable HPC Monitoring System
S. Sanchez, A. Bonnie, G. Van Heule, C. Robinson, A. DeConinck, K. Kelly, Q. Snead, and J. Brandt
2016 IEEE International Parallel and Distributed Processing Symposium Workshops (IPDPSW), Chicago, IL, USA, 2016, pp. 1721-1725
Network Performance Counter Monitoring and Analysis on the Cray XC Platform
J. Brandt, E. Froese, A. Gentile, L. Kaplan, B. Allan, and E. Walsh
Dynamic Model Specific Register (MSR) Data Collection as a System Service
G. H. Bauer, J. Brandt, A. Gentile, A. Kot, and M. Showerman
A. DeConinck, A. Bonnie, K. Kelly, S. Sanchez, C. Martin, and M. Mason (LANL), J. Brandt, A. Gentile, B. Allan, and A. Agelastos (SNL), M. Davis and M. Berry (Cray)
M. Showerman, J. Brandt, and A. Gentile
Smart HPC Centers: Data, Analysis, Feedback, and Response
J. Brandt, A. Gentile, C. Martin, B. Allan, and K. Devine
Monitoring High Speed Network Fabrics: Experiences and Needs
J. Brandt, A. Gentile, B. Allan, S. Lefantzi, and M. Aguilar
Monitoring Large Scale HPC Platforms: Issues, Approaches, and Experiences
2015
Monitoring Large-Scale HPC Systems: Data Analytics and Insights - BOF Session Organizer 🔸
Infrastructure for In Situ System Monitoring and Application Data Analysis
J. Brandt, K. Devine, and A. Gentile
ISAV 2015: Proceedings of the First Workshop on In Situ Infrastructures for Enabling Extreme-Scale Analysis and Visualization, November 2015, Pages 36–40,
New Systems, New Behaviors, New Patterns: Monitoring Insights from System Standup
J. Brandt, A. Gentile, C. Martin, J. Repik, and N. Taerat
Extending LDMS to Enable Performance Monitoring in Multi-Core Applications
S. Feldman, D. Zhang, D. Dechev, and J. Brandt
Toward Rapid Understanding of Production HPC Applications and Systems
A. Agelastos, B. Allan, J. Brandt, A. Gentile, S. Lefantzi, S. Monk, J. Ogden, M. Rajan, and J. Stevenson
Enabling Advanced Operational Analysis Through Multi-Subsystem Data Integration on Trinity – Best Paper Finalist
J. Brandt, D. DeBonis, A. Gentile, J. Lujan, C. Martin, D. Martinez, S. Olivier, K. Pedretti, N. Taerat, and R. Velarde
Scalable Integrated High-Fidelity Continuous Monitoring
at System Monitoring of Cray Systems BoF
Demonstrating Improved Application Performance Using Dynamic Monitoring and Task Mapping – Minisymposium Presentation
J. Brandt, K. Devine, A. Gentile, and K. Pedretti
Minisymposium on Topology Mapping and Locality
2011
OVIS, Lightweight Data Metric Service (LDMS), and Log File Analysis
SC|11 Seattle, WA, November 2011.
Develop Feedback System for Intelligent Dynamic Resource Allocation to Improve Application Performance
J. Brandt, A. Gentile, D. Thompson and T. Tucker
SAND2011-6301. Sandia National Laboratories. September 2011.
Framework for Enabling System Understanding
J. Brandt, F. Chen, A. Gentile, C. Leangsuksun, J. Mayo, P. Pebay, D. Roe, N. Taerat, D. Thompson, and M. Wong
In: Alexander, M., et al. Euro-Par 2011: Parallel Processing Workshops. Euro-Par 2011. Lecture Notes in Computer Science, vol 7156. Springer, Berlin, Heidelberg.
Baler: Deterministic, lossless log message clustering tool
N. Taerat, J. Brandt, A. Gentile, M. Wong, and C. Leangsuksun
In: Computer Science - Research and Development
Volume 26, Numbers 3-4, 285-295, (2011)
2010
OVIS, Lightweight Data Metric Service (LDMS), and Log File Analysis
SC|10 New Orleans, LA, Nov 2010.
Scalable HPC Monitoring and Analysis for Understanding and Automated Response – Invited Presentation
OVIS 3.2 User’s Guide – (NB: Deprecated)
J. Brandt, A. Gentile, C. Houf, J. Mayo, P. Pebay, D. Roe, D. Thompson, and M. Wong
SAND 2010-7109, Sandia National Laboratories, Oct 2010.
Understanding Large Scale HPC Systems Through Scalable Monitoring and Analysis
Understanding Large Scale HPC Systems Through Scalable Monitoring and Analysis – Invited Presentation
Computing Contingency Statistics in Parallel: Design Trade-Offs and Limiting Cases
P. Pébay, D. Thompson, and J. Bennett
A Framework for Graph-Based Synthesis, Analysis, and Visualization of HPC Cluster Job Data
J. Brandt, V. De Sapio, A. Gentile, P. Kegelmeyer, J. Mayo, P. Pebay, D. Roe, D. Thompson, and M. Wong
SAND2010-2400, Sandia National Laboratories, August 2010.
The OVIS analysis architecture – (NB: Deprecated)
J. M. Brandt, V. De Sapio, A. C. Gentile, J. Mayo, P. Pébay, D. Roe, D. Thompson, and M. H. Wong
Sandia Report SAND2010-5107, Sandia National Laboratories, July 2010.
The Python command line interface to the OVIS analysis functionality – (NB: Deprecated)
J. M. Brandt, A. C. Gentile, J. Mayo, P. Pébay, D. Thompson, and M. H. Wong
Sandia Report SAND2010-4289, Sandia National Laboratories, June 2010.
Quantifying Effectiveness of Failure Prediction and Response in HPC Systems: Methodology and Example
J. Brandt, F. Chen, V. De Sapio, A. Gentile, J. Mayo, P. Pébay, D. Roe, D. Thompson, and M. Wong
2010 International Conference on Dependable Systems and Networks Workshops (DSN-W), Chicago, IL, USA, 2010, pp. 2-7
Using Cloud Constructs and Predictive Analysis to Enable Pre-Failure Process Migration in HPC Systems
J. Brandt, F. Chen, V. De Sapio, A. Gentile, J. Mayo, P. Pébay, D. Roe, D. Thompson, and M. Wong
Combining Virtualization, Resource Characterization, and Resource Management to Enable Efficient High Performance Compute Platforms Through Intelligent Dynamic Resource Allocation
J. Brandt, F. Chen, V. De Sapio, A. Gentile, J. Mayo, P. Pébay, D. Roe, D. Thompson, and M. Wong
2010 IEEE International Symposium on Parallel & Distributed Processing, Workshops and Phd Forum (IPDPSW), Atlanta, GA, USA, 2010, pp. 1-8
Scalable Information Fusion for Fault Tolerance in Large-Scale HPC – Minisymposium Presentation
J. Brandt, F. Chen, V. De Sapio, A. Gentile, J. Mayo, P. Pébay, D. Roe, D. Thompson, and M. Wong
Minisymposium on Vertically Integrated Fault Tolerance for Large-Scale Scientific Computing
at the SIAM Conf. on Parallel Processing and Scientific Computing (PP10), Seattle, WA. Feb 2010.
2009
OVIS in HPC: Information Fusion for Resilience
Failure Prediction and Resilience in Large-Scale HPC Platforms
SC|09 Portland, OR, November 2009.
Advanced ParaView Visualization
K. Moreland, J. Ahrens, D. DeMarle, D. Thompson, P. Pébay and N. Fabian
peer-reviewed tutorial on the use of statistics engines at the
IEEE VisWeek 2009, Atlantic City, NJ. October 2009.
Data Fusion and Statistical Analysis: Piercing the Darkness of the Black Box – Invited Presentation
J. Brandt, F. Chen, V. De Sapio, A. Gentile, J. Mayo, P. Pébay, D. Roe, D. Thompson, and M. Wong
Methodologies for Advance Warning of Compute Cluster Problems via Statistical Analysis: A Case Study
J. Brandt, A. Gentile, J. Mayo, P. Pébay, D. Roe, D. Thompson, and M. Wong
Resource Monitoring and Management with OVIS to Enable HPC in Cloud Computing Environments
J. Brandt, A. Gentile, J. Mayo, P. Pébay, D. Roe, D. Thompson, and M. Wong
OVIS 2.0 User’s Guide – (NB: Deprecated)
J. Brandt, A. Gentile, J. Mayo, P. Pébay, D. Roe, D. Thompson, and M. Wong
SAND 2009-2329, Sandia National Laboratories, April 2009
OVIS: Scalable Real-time Analysis of Very Large Datasets
Overview viewgraph. 2009.
2008
OVIS-2: Whole System Monitoring and Analysis - Toward Understanding and Prediction
J. Brandt, B. Debusschere, A. Gentile, J. Mayo, P. Pébay, D. Thompson, and M. Wong
SC|08 Austin, TX. November 2008.
Combining System Characterization and Novel Execution Models to Achieve Scalable Robust Computing – Invited Presentation
H. Adalsteinsson, J. Brandt, B. Debusschere, A. Gentile, J. Mayo, P. Pebay, D. Thompson, and M. Wong
Workshop on Resiliency for Petascale HPC
OVIS: Scalable, Real-time Statistical Analysis of Very Large Datasets
J. Brandt, B. Debusschere, A. Gentile, J. Mayo, P. Pébay , D. Thompson, and M. Wong
2008 Sandia Workshop on Data Mining and Data Analysis
Extended abstract, SAND Report 2008-6109, Sandia National Laboratories, September 2008.
Using Probabilistic Characterization to Reduce Runtime Faults on HPC Systems
J. Brandt, B. Debusschere, A. Gentile, J. Mayo, P. Pébay , D. Thompson, and M. Wong
OVIS-2: A Robust Distributed Architecture for Scalable RAS
J. Brandt, B. Debusschere, A. Gentile, J. Mayo, P. Pébay, D. Thompson, and M. Wong