OVIS Publications

Citing LDMS

To reference LDMS, please use:
Lightweight Distributed Metric Service: A Scalable Infrastructure for Continuous Monitoring of Large Scale Computing Systems and Applications
A. Agelastos, B. Allan, J. Brandt, P. Cassella, J. Enos, J. Fullop, A. Gentile, S. Monk, N. Naksinehaboon, J. Ogden, M. Rajan, M. Showerman, J. Stevenson, N. Taerat, and T. Tucker
SC ‘14: Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis
New Orleans, LA, USA, 2014, pp. 154-165
BibTex:
@INPROCEEDINGS{7013000,
author={Agelastos, Anthony and Allan, Benjamin and Brandt, Jim and Cassella, Paul and Enos, Jeremy and Fullop, Joshi and Gentile, Ann and Monk, Steve and Naksinehaboon, Nichamon and Ogden, Jeff and Rajan, Mahesh and Showerman, Michael and Stevenson, Joel and Taerat, Narate and Tucker, Tom},
booktitle={SC ‘14: Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis},
title={The Lightweight Distributed Metric Service: A Scalable Infrastructure for Continuous Monitoring of Large Scale Computing Systems and Applications},
year={2014},
volume={},
number={},
pages={154-165},
doi={10.1109/SC.2014.18}}

Publications and Selected Presentations

Note: Publications prior to Sept 2011 refer to a different and now deprecated architecture for data collection and transport (i.e., they do NOT use LDMS).

2023

Prodigy: Toward Unsupervised Anomaly Detection in Production HPC Systems
Burak Aksar, Efe Sencan, Benjamin Schwaller, Omar Aaziz, Vitus J. Leung, Jim Brandt, Brian Kulis, Manuel Egele, and Ayse K. Coskun
SC `23: Proc of the Int’l Conf. for High Performance Computing, Networking, Storage and Analysis.
Nov 2023, to appear.
Autonomy Loops for Monitoring, Operational Data Analytics, Feedback, and Response in HPC Operations
Francieli Boito, Jim Brandt, Valeria Cardellini, Philip Carns, Florina M. Ciorba, Hilary Egan, Ahmed Eleliemy, Ann Gentile, Thomas Gruber, Jeff Hanson, Utz-Uwe Haus, Kevin Huck, Thomas Ilsche, Thomas Jakobsche, Terry Jones, Sven Karlsson, Abdullah Mueen, Michael Ott, Tapasya Patki, Krishnan Raghavan, Stephen Simms, Kathleen Shoga, Michael Showerman, Devesh Tiwari, Torsten Wilde, Ivy Peng, and Keiji Yamamoto
Proc. IEEE Cluster
Oct 2023, to appear
Towards Practical Machine Learning Frameworks for Performance Diagnostics in Supercomputers
B. Aksar, E. Sencan, B. Schwaller, V. Leung, J. Brandt, B. Kulis, M. Egele, and A. Coskun
AI4Sys ‘23: Proceedings of the First Workshop on AI for Systems, August 2023, pp 1–6
Evaluating and Influencing Extreme-Scale Monitoring Implementations
Jim Brandt, Chris Morrone, Eric Roman, Ann Gentile, Tom Tucker, Jeff Hanson, Kathleen Shoga, and Alec Scott
Proc Cray User’s Group
May 2023.
Driving HPC Operations With Holistic Monitoring and Operational Data Analytics (Dagstuhl Seminar 23171)
Jim Brandt, Florina Ciorba, Ann Gentile, Michael Ott, and Torsten Wilde
In Dagstuhl Reports, Volume 13, Issue 4, pp. 98-120, Schloss Dagstuhl - Leibniz-Zentrum für Informatik (2023)
Community Readiness and Opportunities for Progress in HPC Monitoring, Analysis, Feedback, and Response – Keynote
J. Brandt
Apr 2023

2022

ALBADross: Active Learning Based Anomaly Diagnosis for Production HPC Systems
B. Aksar, E. Sencan, B. Schwaller, O. Aaziz, V. Leung, J. Brandt, B. Kulis, B and A. Coskun,
2022 IEEE International Conference on Cluster Computing (CLUSTER), Heidelberg, Germany, 2022, pp. 369-380
Metrics for Packing Efficiency and Fairness of HPC Cluster Batch Job Scheduling
A. Goponenko, K. Lamar, C. Peterson, B. Allan, J. Brandt, and D. Dechev
2022 IEEE 34th International Symposium on Computer Architecture and High Performance Computing (SBAC-PAD), Bordeaux, France, 2022, pp. 241-252

2021

Systematically Inferring I/O Performance Variability by Examining Repetitive Job Behavior
E. Costa, T. Patel, B. Schwaller, J. Brandt, D. Tiwari
SC ‘21: Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis, November 2021, Article No.: 33, Pages 1–15
Backfilling HPC Jobs with a Multimodal-Aware Predictor
K. Lamar, A. Goponenko, C. Peterson, B. Allan, J. Brandt, and D. Dechev
2021 IEEE International Conference on Cluster Computing (CLUSTER), Portland, OR, USA, 2021, pp. 618-622
Using Monitoring Data to Improve HPC Performance via Network-Data-Driven Allocation
Y. Zhang, B. Aksar, O. Aaziz, B. Schwaller, J. Brandt, V. Leung, M. Egele, and A. Coskun
2021 IEEE High Performance Extreme Computing Conference (HPEC), Waltham, MA, USA, 2021, pp. 1-7
Integrating Systems Operations into CoDesign – Keynote
Presented by A. Gentile
A. Patke, S. Jha, H. Qui, J. Brandt, A. Gentile, J. Greenseid, A. Kalbarczyk, and R. Iyer
ICS ‘21: Proceedings of the ACM International Conference on Supercomputing, June 2021, Pages 342–353
Presented by A. Gentile
2021 ECP Annual Meeting Center and Application Monitoring WG. Apr 2021.

2020

HPC System Data Pipeline to Enable Meaningful Insights through Analytic-Driven Visualizations
B. Schwaller, N. Tucker, T. Tucker, B. Allan, and J. Brandt
in 2020 IEEE International Conference on Cluster Computing (CLUSTER), Kobe, Japan, 2020 pp. 433-441.
Towards Workload-Adaptive Scheduling for HPC Clusters
A. Goponenko, R. Izadpanah, J. Brandt, and D. Dechev
2020 IEEE International Conference on Cluster Computing (CLUSTER), Kobe, Japan, 2020, pp. 449-453
LDMS Monitoring of EDR InfiniBand Networks – workshop work-in-progress paper & presentation
B. Allan, M. Aguilar, B. Schwaller, S. Langer
2020 IEEE International Conference on Cluster Computing (CLUSTER), Kobe, Japan, 2020, pp. 459-463
Also as Sandia Technical Report SAND2020-8534C (paper) and SAND2020-9599C (presentation).
Inspecting fast commodity RDMA network performance on production systems with LDMS – Workshop presentation
B. Allan, M. Aguilar, B. Schwaller, S. Langer
Sandia Technical report SAND2020-8014C.
Production LDMS, genders, systemd, and the future – Workshop presentation
B. Allan
Sandia Technical report SAND2020-8015C.
LDMS packaging: Moving from tribal knowledge to community knowledge – Workshop presentation
B. Allan
Sandia Technical report SAND2020-8013C.
ALAMO: Autonomous Lightweight Allocation, Management, and Optimization
R. Brightwell, K. B. Ferreira, R. E. Grant, S. Levy, J. Lofstead, S. L. Olivier, K. T. Pedretti, A. J. Younge, A. Gentile, and J. Brandt.
In: Nichols J., Verastegui B., Maccabe A., Hernandez O., Parete-Koon S., Ahearn T. (eds) Driving Scientific and Engineering Discoveries Through the Convergence of HPC, Big Data and AI.
Smoky Mountains Computational Sciences and Engineering Conference (SMC2020) Communications in Computer and Information Science, vol 1315. Springer, Cham., 2020.
S. Jha, A. Gentile, J. Brandt, A. Patke, B. Lim, G. Bauer, M. Showerman, L. Kaplan, Z. Kalbarczyk, W. Kramer, and R. Iyer
Attributing Performance Variation from Integrated Application and System Data – poster
O. Aaziz, B. Allan, J. Brandt, J. Cook., K. Devine, J. Elliott, A. Gentile, S. Olivier, K. Pedretti, and T. Tucker
Applied Computer Science Meeting, Feb 2020.

2019

Enabling Machine Learning-based HPC Performance Diagnostics in Production Environments – Panel Organizer
Organizers: M. Showerman, J. Greenseid, A. Gentile, and J. Brandt
Panelists: W. T. Kramer (NCSA), R. Gerber (NERSC), N. Brown (EPCC), and A. Saxton (NCSA)
SC19, Fri 11/22 8:30 AM Nov 2019
Holistic Measurement Driven System Assessment (HMDSA) – poster
S. Jha, M. Showerman, A. Saxton, J. Enos, G. Bauer, Z. Kalbarczyk, A. Gentile, J. Brandt, R. Iyer, and W. T. Kramer
SC19, Nov 2019.
A Machine Learning Approach to Understanding HPC Application Performance Variation – poster
B. Aksar, B. Schwaller, O. Aaziz, E. Ates, J. Brandt, A. K. Coskun, M. Egele, and V. Leung
SC19, Nov 2019.
LDMS v4: Writing Sampler and Store Plugins
A. Gentile
Sandia National Laboratories, SAND2019-12858 O, Oct 2019.
Figures of merit for production HPC
B. Allan
Sandia National Laboratories, SAND2019-12564, Oct. 2019.
Proxy or Imposter? A Method and Case Study to Determine the Answer
O. Aaziz, J. Cook, C. Vaughan, and D. Richards
2019 IEEE International Conference on Cluster Computing (CLUSTER), Albuquerque, NM, USA, 2019, pp. 1-9
Standardized Environment for Monitoring Heterogeneous Architectures
C. Brown, B. Schwaller, N. Gauntt, B. Allan and K. Davis
2019 IEEE International Conference on Cluster Computing (CLUSTER), Albuquerque, NM, USA, 2019, pp. 1-5
A Study of Network Congestion in Two Supercomputing High-Speed Interconnects
S. Jha, A. Patke, J. Brandt, A. Gentile, M. Showerman, E. Roman, Z. Kalbarczyk, and R. Iyer
in 2019 IEEE Symposium on High-Performance Interconnects (HOTI), Santa Clara, CA, USA, 2019, pp. 45-48
B. Allan
Sandia National Laboratories, SAND2019-10266C, Aug. 2019.
HPAS: An HPC Performance Anomaly Suite for Reproducing Performance Variations
E. Ates, Y. Zhang, B. Aksar, J. Brandt, V. J. Leung, M. Egele, and A. K. Coskun
ICPP ‘19: Proceedings of the 48th International Conference on Parallel Processing, August 2019, Article No.: 40, Pages 1–10
Production Application Performance Data Streaming for System Monitoring
R. Izadpanah, B. Allan, D. Dechev, and J. Brandt
ACM Transactions on Modeling and Performance Evaluation of Computing Systems (TOMPECS). Vol 4 Issue 2, Article No.: 8, pp 1–25, 2019
Exploring New Monitoring and Analysis Capabilities on Cray’s Software Preview System
J. Brandt, C. Brown, S. Donoho, A. Gentile, J. Greenseid, W. Kramer, P. Langer, A. Rashid, K. Rehm, and M. Showerman
Extracting Actionable System-Application Performance Factors
J. Brandt, A. Gentile, and J. Cook
Minisymposium on Modeling Resource Utilization and Contention in HPC System-Application Interactions – Minisymposium Organizer
Holistic Measurement Driven System Assessment (HMDSA) – poster
Bill Kramer, Greg Bauer, Brett Bode, Mike Showerman, Jeremy Enos, Aaron Saxton, Saurabh Jha, Zbigniew Kalbarczyk, and Ravishankar Iyer (NCSA/UIUC) and James Brandt and Ann Gentile (SNL)
Two Weeks In The Life of Skybridge – SLURM and LDMS metrics and metadata.
B. Allan
Sandia National Laboratories SAND 2019-4915, April 2019.

2018

Platform Independent Run Time HPC Monitoring, Analysis, and Feedback at Any-Scale – Featured Presentation at DOE Booth
J. Brandt
SC18, Nov 2018.
Monitoring Large-Scale HPC Systems: Extracting and Presenting Meaningful System and Application Insights – BoF Session Organizer
SC18, Nov 2018.
An Efficient Latch-free Database Index Based on Multi-dimensional Lists
K. Lamar, R. Izadpanah, J. Brandt, and D. Dechev
2018 IEEE 37th International Performance Computing and Communications Conference (IPCCC), Orlando, FL, USA, 2018, pp. 1-2
Online Diagnosis of Performance Variation in HPC Systems Using Machine Learning
O. Tuncer, E. Ates, Y. Zhang, A. Turk, J. Brandt, V. Leung, M.Egele, and A. Coskun
IEEE Transactions on Parallel and Distributed Systems
A Methodology for Characterizing the Correspondence Between Real and Proxy Applications
O. Aaziz, J.M. Cook, J. Cook, T. Juedeman, D. Richards, and C. Vaughan
2018 IEEE International Conference on Cluster Computing (CLUSTER), Belfast, UK, 2018, pp. 190-200
Large-Scale System Monitoring Experiences and Recommendations – Invited Peer-Reviewed Submission at HPCMASPA
V. Ahlgren, S. Andersson, J. Brandt, N. P. Cardo, S. Chunduri, J. Enos, P. Fields, A. Gentile, R. Gerber, M. Gienger, J. Greenseid, A. Greiner, B. Hadri, Y. (Helen) He, D. Hoppe, U. Kaila, K. Kelly, M. Klein, A. Kristiansen, S. Leak, M. Mason, K. Pedretti, J-G. Piccinali, J. Repik, J. Rogers, S. Salminen, M. Showerman, C. Whitney, and J. Williams (Authors representing ALCF, CSC, CSCS, HLRS, KAUST, LANL, NCSA, NERSC, ORNL, SNL, and Cray)
2018 IEEE International Conference on Cluster Computing (CLUSTER), Belfast, UK, 2018, pp. 532-542
Characterizing Supercomputer Traffic Networks Through Link-Level Analysis
S. Jha, J. Brandt, A. Gentile, Z. Kalbarczyk, and R. Iyer
2018 IEEE International Conference on Cluster Computing (CLUSTER), Belfast, UK, 2018, pp. 562-570
Modeling Expected Application Runtime for Characterizing and Assessing Job Performance
O. Aaziz, J. Cook, and M. Tanash
2018 IEEE International Conference on Cluster Computing (CLUSTER), Belfast, UK, 2018, pp. 543-551
Taxonomist: Application Detection through Rich Monitoring Data – Best Artifact Award
E. Ates, O. Tuncer, A. Turk, V. J. Leung, J. Brandt, M. Egele and A. K. Coskun
Euro-Par 2018: Parallel Processing: 24th International Conference on Parallel and Distributed Computing, Turin, Italy, August 27 - 31, 2018, Pages 92–105
Integrating Low-latency Analysis into HPC System Monitoring
R. Izadpanah, N. Naksinehaboon, J. Brandt, A. Gentile, and D. Dechev
ICPP ‘18: Proceedings of the 47th International Conference on Parallel Processing, August 2018, Article No.: 5, Pages 1–10
Cray System Monitoring: Successes, Requirements, Priorities
V. Ahlgren, S. Andersson, J. Brandt, N. P. Cardo, S. Chunduri, J. Enos, P. Fields, A. Gentile, R. Gerber, J. Greenseid, A. Greiner, B. Hadri, Y. He, D. Hoppe, U. Kaila, K. Kelly, M. Klein, A. Kristiansen, S. Leak, M. Mason, K. Pedretti, J-G. Piccinali, J. Repik, J. Rogers, S. Salminen, M. Showerman, C. Whitney, and J. Williams. (Authors representing ALCF, CSC, CSCS, HLRS, KAUST, LANL, NCSA, NERSC, ORNL, SNL, and Cray)
Proc. Cray Users Group (CUG), Stockholm, Sweden. May 2018.
Supporting Failure Analysis with Discoverable, Annotated Log Datasets
S. Leak, A. Greiner, A. Gentile, and J. Brandt
Proc. Cray Users Group (CUG), Stockholm, Sweden. May 2018.
Automated Analysis and Effective Feedback – BOF Session Organizer
M. Showerman, J. Brandt, and A. Gentile
Runtime HPC System and Application Performance Assessment and Diagnostics
J. Brandt, A. Gentile, Jon Cook, B. Allan, Jeanine Cook, O. Aaziz, T. Tucker, N. Naksinehaboon, N. Taerat, E. Ates, O. Tuncer, M. Egele, A. Turk, and A. Coskun
Conference on Data Analysis (CODA), Sante Fe, NM, March 2018.
Continuous Performance Tracking for Kokkos using LDMS
J. Brandt, S. Hammond, T. Tucker, A. Gentile, and J. Cook
Programming Models and CoDesign Meeting, Albuquerque, NM. Feb 2018.

2017

Systems Monitoring Data in Action – BoF Session Organizer
SC17, 12:15pm-1:15 pm Thurs Nov 16 2017.
Holistic Measurement Driven System Assessment
S. Jha, J. Brandt, A. Gentile, Z. Kalbarczyk, G. Bauer, J. Enos, M. Showerman, L. Kaplan, B. Bode, A. Greiner, A. Bonnie, M. Mason, R. Iyer, and W. Kramer
2017 IEEE International Conference on Cluster Computing (CLUSTER), Honolulu, HI, USA, 2017, pp. 797-800
Diagnosing Performance Variations in HPC Applications Using Machine Learning – Gauss Award Winner
O. Tuncer, E. Ates, Y. Zhang, A. Turk, J. Brandt, V. J. Leung, M. Egele, and A. K. Coskun
High Performance Computing: 32nd International Conference, ISC High Performance 2017, Frankfurt, Germany, June 18–22, 2017, Pages 355–373
LDMS Version 3 Tutorial and Demo Material – (NB: Deprecated)
J. Brandt, T. Tucker, A. Gentile, N. Naksinehaboon, and N. Taerat
Sandia National Laboratories, SAND2017-5153 O, May 2017.
Understanding Fault Scenarios and Impacts Through Fault Injection Experiments in Cielo
V. Formicola, S. Jha, F. Deng, D. Chen (UIUC), A. Bonnie, M. Mason (LANL), J. Brandt, A. Gentile (SNL), L. Kaplan, J. Repik (Cray), J, Enos, M. Showerman (NCSA), A. Greiner (NERSC), Z. Kalbarczyk, R. Iyer, and W. Kramer (UIUC)
Runtime Collection and Analysis of System Metrics for Production Monitoring of Trinity Phase II
A. DeConinck, H. Nam, D. Morton, A. Bonnie, C. Lueninghoener (LANL), J. Brandt, A. Gentile, K. Pedretti, A. Agelastos, C. Vaughan, S. Hammond, B. Allan (SNL), M. Davis and J. Repik (Cray)
Holistic Systems Monitoring and Analysis – BOF Session Organizer
M. Showerman, J. Brandt, and A. Gentile
Contention and Congestion: Challenges and Approaches to Understanding Application Impact
A. Gentile, J. Brandt, A. Agelastos, and J. Lamb, K. Ruggirello, and J. Stevenson

2016

SC16, Fri 18th Nov 2016 10:30-noon.
Monitoring Large Scale HPC Systems: Understanding, Diagnosis and Attribution of Performance Variation and Issues – BoF Session Organizer
SC16, 5:15pm-7pm Wed Nov 16 2016.
Discovery, Interpretation, and Communication of Meaningful Information in HPC Monitoring Data
Holistic Measurement Driven Resilience
Chaos Community Day Seattle, WA. Aug. 2016.
Continuous Whole-System Monitoring Toward Rapid Understanding of Production HPC Applications and Systems
A. Agelastos, B. Allan, J. Brandt, A. Gentile, S. Lefantzi, S. Monk, J. Ogden, M. Rajan, and J. Stevenson
Large-Scale Persistent Numerical Data Source Monitoring System Experiences
J. Brandt, A. Gentile, M. Showerman, J. Enos, J. Fullop, and G. Bauer
2016 IEEE International Parallel and Distributed Processing Symposium Workshops (IPDPSW), Chicago, IL, USA, 2016, pp. 1711-1720
Design and Implementation of a Scalable HPC Monitoring System
S. Sanchez, A. Bonnie, G. Van Heule, C. Robinson, A. DeConinck, K. Kelly, Q. Snead, and J. Brandt
2016 IEEE International Parallel and Distributed Processing Symposium Workshops (IPDPSW), Chicago, IL, USA, 2016, pp. 1721-1725
Network Performance Counter Monitoring and Analysis on the Cray XC Platform
J. Brandt, E. Froese, A. Gentile, L. Kaplan, B. Allan, and E. Walsh
Proc. Cray Users Group (CUG), May 2016.
Dynamic Model Specific Register (MSR) Data Collection as a System Service
G. H. Bauer, J. Brandt, A. Gentile, A. Kot, and M. Showerman
Proc. Cray Users Group (CUG), May 2016.
A. DeConinck, A. Bonnie, K. Kelly, S. Sanchez, C. Martin, and M. Mason (LANL), J. Brandt, A. Gentile, B. Allan, and A. Agelastos (SNL), M. Davis and M. Berry (Cray)
Proc. Cray Users Group (CUG), May 2016.
M. Showerman, J. Brandt, and A. Gentile
Proc. Cray Users Group (CUG), May 2016.
Smart HPC Centers: Data, Analysis, Feedback, and Response
J. Brandt, A. Gentile, C. Martin, B. Allan, and K. Devine
Monitoring High Speed Network Fabrics: Experiences and Needs
J. Brandt, A. Gentile, B. Allan, S. Lefantzi, and M. Aguilar
at Open Fabrics Alliance Workshop, Monterey, CA. Apr 2016.
Monitoring Large Scale HPC Platforms: Issues, Approaches, and Experiences

2015

LDMS Demo at DOE Booth SC15 Nov 2015.
Monitoring Large-Scale HPC Systems: Data Analytics and Insights - BOF Session Organizer 🔸
Infrastructure for In Situ System Monitoring and Application Data Analysis
J. Brandt, K. Devine, and A. Gentile
ISAV 2015: Proceedings of the First Workshop on In Situ Infrastructures for Enabling Extreme-Scale Analysis and Visualization, November 2015, Pages 36–40,
New Systems, New Behaviors, New Patterns: Monitoring Insights from System Standup
J. Brandt, A. Gentile, C. Martin, J. Repik, and N. Taerat
2015 IEEE International Conference on Cluster Computing, Chicago, IL, USA, 2015, pp. 658-665
Extending LDMS to Enable Performance Monitoring in Multi-Core Applications
S. Feldman, D. Zhang, D. Dechev, and J. Brandt
2015 IEEE International Conference on Cluster Computing, Chicago, IL, USA, 2015, pp. 717-720
Toward Rapid Understanding of Production HPC Applications and Systems
A. Agelastos, B. Allan, J. Brandt, A. Gentile, S. Lefantzi, S. Monk, J. Ogden, M. Rajan, and J. Stevenson
2015 IEEE International Conference on Cluster Computing, Chicago, IL, USA, 2015, pp. 464-473
Enabling Advanced Operational Analysis Through Multi-Subsystem Data Integration on Trinity – Best Paper Finalist
J. Brandt, D. DeBonis, A. Gentile, J. Lujan, C. Martin, D. Martinez, S. Olivier, K. Pedretti, N. Taerat, and R. Velarde
Proc. Cray User’s Group (CUG), Chicago, IL. April 2015.
Scalable Integrated High-Fidelity Continuous Monitoring
at System Monitoring of Cray Systems BoF
Proc. Cray User’s Group (CUG), Chicago, IL. April 2015.
Demonstrating Improved Application Performance Using Dynamic Monitoring and Task Mapping – Minisymposium Presentation
J. Brandt, K. Devine, A. Gentile, and K. Pedretti
Minisymposium on Topology Mapping and Locality

2014

Extreme-scale HPC Monitoring
Lightweight Distributed Metric Service: A Scalable Infrastructure for Continuous Monitoring of Large Scale Computing Systems and Applications
A. Agelastos, B. Allan, J. Brandt, P. Cassella, J. Enos, J. Fullop, A. Gentile, S. Monk, N. Naksinehaboon, J. Ogden, M. Rajan, M. Showerman, J. Stevenson, N. Taerat, and T. Tucker
New Orleans, LA, USA, 2014, pp. 154-165
Monitoring Large-Scale HPC Systems: Issues and Approaches – BOF Session Organizer
Demonstrating Improved Application Performance Using Dynamic Monitoring and Task Mapping
J. Brandt, K. Devine, A. Gentile, and K. Pedretti
Monitoring Application Resource Utilization on the Intel PHI Coprocessor – Minitalk
J. Brandt and A. Gentile

Benjamin Allan | 1st Workshop on Monitoring and Analysis for High Performance Computing Systems Plus Applications (HPCMASPA) at IEEE Int’l. Conf. on Cluster Computing (CLUSTER), Madrid, Spain. Sept 2014.

Large Scale System Monitoring and Analysis on Blue Waters Using OVIS – Best Paper Finalist
M. Showerman, J. Enos, J. Fullop (NCSA), P. Cassella (Cray), N. Naksinehaboon, N. Taerat, T. Tucker (OGC), J. Brandt, A. Gentile, and B. Allan (SNL)
Proc. Cray User’s Group (CUG), Lugano, Switzerland. May 2014.
Large Scale HPC Monitoring
New Mexico State University, Las Cruses, NM. April 2014.

2013

J. Brandt, T. Tucker, A. Gentile, D. Thompson, V. Kuhns, and J. Repik
Proc. Cray User’s Group (CUG), Napa Valley, CA. May 2013.

2012

Filtering Log Data: Finding Needles in the Haystack
L. Yu, Z. Zheng, Z. Lan, T. Jones, J. Brandt, and A. Gentile
Report of Experiments and Evidence for ASC L2 Milestone 4467 - Demonstration of a Legacy Application’s Path to Exascale
B. Barrett, R. Barrett, J. Brandt, R. Brightwell, M. Curry, N. Fabian, K. Ferreira, A. Gentile, S. Hemmert, S. Kelly, R. Klundt, J. Laros, V. Leung, M. Levenhagen, G. Lofstead, K. Moreland, R. Oldfield, K. Pedretti, A. Rodrigues, D. Thompson, T. Tucker, L. Ward, J. Van Dyke, C. Vaughan, and K. Wheeler
SAND2012-1750. Sandia National Laboratories. March 2012.

2011

OVIS, Lightweight Data Metric Service (LDMS), and Log File Analysis
SC|11 Seattle, WA, November 2011.
Develop Feedback System for Intelligent Dynamic Resource Allocation to Improve Application Performance
J. Brandt, A. Gentile, D. Thompson and T. Tucker
SAND2011-6301. Sandia National Laboratories. September 2011.
Framework for Enabling System Understanding
J. Brandt, F. Chen, A. Gentile, C. Leangsuksun, J. Mayo, P. Pebay, D. Roe, N. Taerat, D. Thompson, and M. Wong
In: Alexander, M., et al. Euro-Par 2011: Parallel Processing Workshops. Euro-Par 2011. Lecture Notes in Computer Science, vol 7156. Springer, Berlin, Heidelberg.
Baler: Deterministic, lossless log message clustering tool
N. Taerat, J. Brandt, A. Gentile, M. Wong, and C. Leangsuksun
In: Computer Science - Research and Development
Volume 26, Numbers 3-4, 285-295, (2011)

2010

OVIS, Lightweight Data Metric Service (LDMS), and Log File Analysis
SC|10 New Orleans, LA, Nov 2010.
  • Exhibit ASC Booth Demos

  • Exhibit ASC Booth talk: OVIS 3: Scalable Data Collection and Analysis for Large Scale HPC System Understanding

Scalable HPC Monitoring and Analysis for Understanding and Automated Response – Invited Presentation
HPC Resilience Summit 2010: Workshop on Resilience for Exascale HPC at the Los Alamos Computer Science Symposium, Santa Fe, NM. Oct 2010.
OVIS 3.2 User’s Guide – (NB: Deprecated)
J. Brandt, A. Gentile, C. Houf, J. Mayo, P. Pebay, D. Roe, D. Thompson, and M. Wong
SAND 2010-7109, Sandia National Laboratories, Oct 2010.
Understanding Large Scale HPC Systems Through Scalable Monitoring and Analysis
New Mexico State University, Las Cruces, NM. October 2010.
Understanding Large Scale HPC Systems Through Scalable Monitoring and Analysis – Invited Presentation
European Grid Initiative (EGI) Technical Forum 2010, Amsterdam, Netherlands. September 2010.
Computing Contingency Statistics in Parallel: Design Trade-Offs and Limiting Cases
P. Pébay, D. Thompson, and J. Bennett
2010 IEEE International Conference on Cluster Computing, Heraklion, Greece, 2010, pp. 156-165
A Framework for Graph-Based Synthesis, Analysis, and Visualization of HPC Cluster Job Data
J. Brandt, V. De Sapio, A. Gentile, P. Kegelmeyer, J. Mayo, P. Pebay, D. Roe, D. Thompson, and M. Wong
SAND2010-2400, Sandia National Laboratories, August 2010.
The OVIS analysis architecture – (NB: Deprecated)
J. M. Brandt, V. De Sapio, A. C. Gentile, J. Mayo, P. Pébay, D. Roe, D. Thompson, and M. H. Wong
Sandia Report SAND2010-5107, Sandia National Laboratories, July 2010.
The Python command line interface to the OVIS analysis functionality – (NB: Deprecated)
J. M. Brandt, A. C. Gentile, J. Mayo, P. Pébay, D. Thompson, and M. H. Wong
Sandia Report SAND2010-4289, Sandia National Laboratories, June 2010.
Quantifying Effectiveness of Failure Prediction and Response in HPC Systems: Methodology and Example
J. Brandt, F. Chen, V. De Sapio, A. Gentile, J. Mayo, P. Pébay, D. Roe, D. Thompson, and M. Wong
2010 International Conference on Dependable Systems and Networks Workshops (DSN-W), Chicago, IL, USA, 2010, pp. 2-7
Using Cloud Constructs and Predictive Analysis to Enable Pre-Failure Process Migration in HPC Systems
J. Brandt, F. Chen, V. De Sapio, A. Gentile, J. Mayo, P. Pébay, D. Roe, D. Thompson, and M. Wong
Combining Virtualization, Resource Characterization, and Resource Management to Enable Efficient High Performance Compute Platforms Through Intelligent Dynamic Resource Allocation
J. Brandt, F. Chen, V. De Sapio, A. Gentile, J. Mayo, P. Pébay, D. Roe, D. Thompson, and M. Wong
2010 IEEE International Symposium on Parallel & Distributed Processing, Workshops and Phd Forum (IPDPSW), Atlanta, GA, USA, 2010, pp. 1-8
Scalable Information Fusion for Fault Tolerance in Large-Scale HPC – Minisymposium Presentation
J. Brandt, F. Chen, V. De Sapio, A. Gentile, J. Mayo, P. Pébay, D. Roe, D. Thompson, and M. Wong
Minisymposium on Vertically Integrated Fault Tolerance for Large-Scale Scientific Computing

at the SIAM Conf. on Parallel Processing and Scientific Computing (PP10), Seattle, WA. Feb 2010.

2009

OVIS in HPC: Information Fusion for Resilience
Louisiana Tech University Ruston, LA. December 2009.
Failure Prediction and Resilience in Large-Scale HPC Platforms
SC|09 Portland, OR, November 2009.
  • Exhibit Presentation and Demo

Advanced ParaView Visualization
K. Moreland, J. Ahrens, D. DeMarle, D. Thompson, P. Pébay and N. Fabian
peer-reviewed tutorial on the use of statistics engines at the IEEE VisWeek 2009, Atlantic City, NJ. October 2009.
Data Fusion and Statistical Analysis: Piercing the Darkness of the Black Box – Invited Presentation
J. Brandt, F. Chen, V. De Sapio, A. Gentile, J. Mayo, P. Pébay, D. Roe, D. Thompson, and M. Wong
Methodologies for Advance Warning of Compute Cluster Problems via Statistical Analysis: A Case Study
J. Brandt, A. Gentile, J. Mayo, P. Pébay, D. Roe, D. Thompson, and M. Wong
Resource Monitoring and Management with OVIS to Enable HPC in Cloud Computing Environments
J. Brandt, A. Gentile, J. Mayo, P. Pébay, D. Roe, D. Thompson, and M. Wong
  • Note: 5th Workshop on System Management Techniques, Processes, and Services (SMTPS) - Special Focus on Cloud Computing – Best Paper Award

OVIS 2.0 User’s Guide – (NB: Deprecated)
J. Brandt, A. Gentile, J. Mayo, P. Pébay, D. Roe, D. Thompson, and M. Wong
SAND 2009-2329, Sandia National Laboratories, April 2009
OVIS: Scalable Real-time Analysis of Very Large Datasets
Overview viewgraph. 2009.

2008

OVIS-2: Whole System Monitoring and Analysis - Toward Understanding and Prediction
J. Brandt, B. Debusschere, A. Gentile, J. Mayo, P. Pébay, D. Thompson, and M. Wong
SC|08 Austin, TX. November 2008.
  • Exhibit Presentation and Demo

Combining System Characterization and Novel Execution Models to Achieve Scalable Robust Computing – Invited Presentation
H. Adalsteinsson, J. Brandt, B. Debusschere, A. Gentile, J. Mayo, P. Pebay, D. Thompson, and M. Wong
Workshop on Resiliency for Petascale HPC
at the Los Alamos Computer Science Symposium (LACSS 2008), Santa Fe, NM. October 2008.
OVIS: Scalable, Real-time Statistical Analysis of Very Large Datasets
J. Brandt, B. Debusschere, A. Gentile, J. Mayo, P. Pébay , D. Thompson, and M. Wong
2008 Sandia Workshop on Data Mining and Data Analysis
Extended abstract, SAND Report 2008-6109, Sandia National Laboratories, September 2008.
Using Probabilistic Characterization to Reduce Runtime Faults on HPC Systems
J. Brandt, B. Debusschere, A. Gentile, J. Mayo, P. Pébay , D. Thompson, and M. Wong
OVIS-2: A Robust Distributed Architecture for Scalable RAS
J. Brandt, B. Debusschere, A. Gentile, J. Mayo, P. Pébay, D. Thompson, and M. Wong

2007

OVIS-2: A Distributed Framework for Scalable Monitoring and Analysis of Large Computational Clusters
J. Brandt, B. Debusschere, A. Gentile, J. Mayo, P. Pébay, D. Thompson, and M. Wong
SC|07 Reno, NV, November 2007.
  • Exhibit Presentation and Demo

2006

Monitoring Computational Clusters with OVIS
J. M. Brandt, A. C. Gentile, P. P. Pébay and M. H. Wong
SAND Report 2006-7939, Sandia National Laboratories, December 2006.
OVIS: A Tool for Intelligent, Real-time Monitoring of Computational Clusters
J. M. Brandt, A. C. Gentile, J. Ortega, P. P. Pébay, D. C. Thompson, and M. H. Wong
SC|06 Tampa, FL, November 2006.
  • Exhibit Presentation and Demo

OVIS: A Tool for Intelligent, Real-Time Monitoring of Computational Clusters
Distributed, Intelligent RAS System for Large Computational Clusters: FactSheet
J. M. Brandt, A. C. Gentile, P. P. Pébay and M. H. Wong
Fact sheet, Sandia National Laboratories, April 2006.

2005

Bayesian Inference for Intelligent, Real-time Monitoring of Computational Clusters
J. M. Brandt, A. C. Gentile, D. J. Hale, Y. M. Marzouk, and P. P. Pébay
SC|05 Seattle, Washington, November 2005.
  • Exhibit Presentation, Demo, and Flier

  • Conference Poster

Meaningful Automated Statistical Analysis of Large Computational Clusters
J. M. Brandt, A. C. Gentile, Y. M. Marzouk and P. P. Pebay
Meaningful Automated Statistical Analysis of Large Computational Clusters
J. M. Brandt, A. C. Gentile, Y. M. Marzouk, and P. P. Pébay
SAND Report 2005-4558, Sandia National Laboratories, July 2005.

2004

Detection of System Abnormalities Through Behavioral Analysis of ASC Codes
J. M. Brandt and A. C. Gentile
SC|04 Exhibit, Pittsburgh, PA, November 2004.
  • Exhibit Demo

2003

Distributed Intelligent RAS System for Large Computational Clusters
J. M. Brandt, N. M. Berry, R. A. Yao, B. M. Tsudama, and A. C. Gentile
SC|03, Phoenix, AZ November 2003.
  • Exhibit Demo

  • Conference Poster

Dataset Releases - HMDR

The ASCR funded exascale resilience project Holistic Measurement Driven Resilience: Combining Operational Fault and Failure Measurements and Fault Injection for Quantifying Fault Detection and Impact released the following system datasets in support of resilience research:

2019

Cielo Fault Injection Dataset 2016
S. Jha, V. Formicola, A. Bonnie, M. Mason, D. Chen, F. Deng, A. Gentile, J. Brandt, L. Kaplan, J. Repik, J. Enos, M. Showerman, A. Greiner, Z. Kalbarczyk, R. Iyer, and W. Kramer.
LA-UR-19-22749, SAND2019-3531 O, Mar 2019.

2016

Mutrino Dataset 2/15-6/16 (12/16 Release) (About)
J. Brandt, A. Gentile, and J. Repik
SAND2016-12310 O, Dec 2016
Mutrino Dataset 2/15-5/15 (About)
J. Brandt, A. Gentile, and J. Repik
SAND2016-2449 O, Mar 2016