LDMS Quick Start
Installation
AlmaLinux8
Prerequisites
gnu compiler
swig
autoconf
libtool
readline
readline-devel
libevent
libevent-dev
autogen-libopts
gettext
python3.8
python38-Cython
python38-libs
glib2-devel
git
bison
make
byacc
flex
Prerequisite Installation
The following steps were ran on AlmaLinux8 arm64v8
sudo dnf update -y
sudo dnf install -y openssl
sudo dnf install -y openssl-devel
sudo dnf install -y swig
sudo dnf install -y libtool
sudo dnf install -y readline
sudo dnf install -y readline-devel
sudo dnf install -y libevent
sudo dnf install -y libevent-devel
sudo dnf install -y autogen-libopts
sudo dnf install -y gettext.a
sudo dnf install -y glib2
sudo dnf install -y glib2-devel
sudo dnf install -y git
sudo dnf install -y bison
sudo dnf install -y make
sudo dnf install -y byacc
sudo dnf install -y flex
sudo dnf install -y python38
sudo dnf install -y python38-devel
sudo dnf install -y python38-Cython
sudo dnf install -y python38-libs
RHEL 9
Prerequisites
RHEL 9
openssl-devel
pkg-config
automake
libtool
python3 (or higher)
python3-devel (or higher)
cython
bison
flex
Prerequisite Installation
The following steps were ran on a basic RHEL 9 instance via AWS.
sudo yum update -y
sudo yum install automake -y
sudo yum install openssl-devel -y
sudo yum install pkg-config -y
sudo yum install libtool -y
sudo yum install python3 -y
sudo yum install python3-devel.x86_64 -y
sudo yum install python3-Cython -y
sudo yum install make -y
sudo yum install bison -y
sudo yum install flex -y
LDMS Source Installation Instructions
Getting the Source
mkdir $HOME/Source
mkdir $HOME/ovis
cd $HOME/Source
git clone -b OVIS-4.4.2 https://github.com/ovis-hpc/ovis.git ovis-4
Building the Source
Run autogen.sh
cd $HOME/Source/ovis
./autogen.sh
Configure and Build (Builds default linux samplers. Build installation directory is prefix):
mkdir build
cd build
../configure --prefix=$HOME/ovis/4.4.2
make
make install
Basic Configuration and Running
Set up environment:
OVIS=$HOME/ovis/4.4.2
export LD_LIBRARY_PATH=$OVIS/lib:$LD_LIBRARY_PATH
export LDMSD_PLUGIN_LIBPATH=$OVIS/lib/ovis-ldms
export ZAP_LIBPATH=$OVIS/lib/ovis-ldms
export PATH=$OVIS/sbin:$OVIS/bin:$PATH
export PYTHONPATH=$OVIS/lib/python3.8/site-packages
Sampler
Edit a new configuration file, named sampler.conf, to load the meminfo and vmstat samplers. For this example, it can be saved anywhere, but it will be used later to start the LDMS Daemon (ldmsd)
The following configuration employs generic hostname, uid, gid, component id, and permissions octal set values.
Sampling intervals are set using a “microsecond” time unit (i.e., 1 sec=1e+6 µs), and are adjustable, as needed. Some suggestions include:
Sampler |
Seconds (sec) |
Microseconds (µs) |
---|---|---|
Power |
0.1 sec |
100000 µs |
Meminfo |
1.0 sec |
1000000 µs |
VMstat |
10 sec |
10000000 µs |
Link Status |
60 sec |
60000000 µs |
Note
Sampling offset is typically set to 0 for sampler plugins.
As an alternative to the configuration above, one may, instead, export environmental variables to set LDMS’s runtime configuration by using variables to reference those values in the sampler configuration file.
The following setup will set the samplers to collect at 1 second, (i.e., 1000000 µs) intervals:
export HOSTNAME=${HOSTNAME:=$(hostname -s)} #Typically already is set, set if not
export COMPONENT_ID=1
export SAMPLE_INTERVAL=1000000
export SAMPLE_OFFSET=50000
Run a daemon using munge authentication:
ldmsd -x sock:10444 -c sampler.conf -l /tmp/demo_ldmsd.log -v DEBUG -a munge -r $(pwd)/ldmsd.pid
Or in non-cluster environments where munge is unavailable:
ldmsd -x sock:10444 -c sampler.conf -l /tmp/demo_ldmsd.log -v DEBUG -r $(pwd)/ldmsd.pid
Note
For the rest of these instructions, omit the “-a munge” if you do not have munge running. This will also write out DEBUG-level information to the specified (-l) log.
Run ldms_ls on that node to see set, meta-data, and contents:
ldms_ls -h localhost -x sock -p 10444 -a munge
ldms_ls -h localhost -x sock -p 10444 -v -a munge
ldms_ls -h localhost -x sock -p 10444 -l -a munge
Note
Note the use of munge. Users will not be able to query a daemon launched with munge if not querying with munge. Users will only be able to see sets as allowed by the permissions in response to ldms_ls.
Example (note permissions and update hint):
ldms_ls -h localhost -x sock -p 10444 -l -v -a munge
Output:
host1/vmstat: consistent, last update: Mon Oct 22 16:58:15 2018 -0600 [1385us]
APPLICATION SET INFORMATION ------
updt_hint_us : 5000000:0
METADATA --------
Producer Name : host1
Instance Name : host1/vmstat
Schema Name : vmstat
Size : 5008
Metric Count : 110
GN : 2
User : root(0)
Group : root(0)
Permissions : -rwxr-xr-x
DATA ------------
Timestamp : Mon Oct 22 16:58:15 2018 -0600 [1385us]
Duration : [0.000106s]
Consistent : TRUE
Size : 928
GN : 110
-----------------
M u64 component_id 1
D u64 job_id 0
D u64 app_id 0
D u64 nr_free_pages 32522123
...
D u64 pglazyfree 1082699829
host1/meminfo: consistent, last update: Mon Oct 22 16:58:15 2018 -0600 [1278us]
APPLICATION SET INFORMATION ------
updt_hint_us : 5000000:0
METADATA --------
Producer Name : host1
Instance Name : host1/meminfo
Schema Name : meminfo
Size : 1952
Metric Count : 46
GN : 2
User : myuser(12345)
Group : myuser(12345)
Permissions : -rwx------
DATA ------------
Timestamp : Mon Oct 22 16:58:15 2018 -0600 [1278us]
Duration : [0.000032s]
Consistent : TRUE
Size : 416
GN : 46
-----------------
M u64 component_id 1
D u64 job_id 0
D u64 app_id 0
D u64 MemTotal 131899616
D u64 MemFree 130088492
D u64 MemAvailable 129556912
...
D u64 DirectMap1G 134217728
Aggregator Using Data Pull
Start another sampler daemon with a similar configuration on host2 using component_id=2, as above.
Make a configuration file (called agg11.conf) to aggregate from the two samplers at different intervals with the following contents:
On host3, set up the environment as above and run a daemon:
ldmsd -x sock:10445 -c agg11.conf -l /tmp/demo_ldmsd.log -v ERROR -a munge
Run ldms_ls on the aggregator node to see set listing:
ldms_ls -h localhost -x sock -p 10445 -a munge
Output:
host1/meminfo
host1/vmstat
host2/meminfo
host2/vmstat
You can also run ldms_ls to query the ldms daemon on the remote node:
ldms_ls -h host1 -x sock -p 10444 -a munge
Output:
host1/meminfo
host1/vmstat
Note
ldms_ls -l shows the detailed output, including timestamps. This can be used to verify that the aggregator is aggregating the two hosts’ sets at different intervals.
Aggregator Using Data Push
Use same sampler configurations as above.
Make a configuration file (called agg11_push.conf) to cause the two samplers to push their data to the aggregator as they update.
Note that the prdcr configs remain the same as above but the updater_add includes the additional options: push=onchange auto_interval=false.
Note that the updtr_add interval has no effect in this case but is currently required due to syntax checking
prdcr_add name=host1 host=host1 type=active xprt=sock port=10444 interval=20000000
prdcr_start name=host1
prdcr_add name=host2 host=host2 type=active xprt=sock port=10444 interval=20000000
prdcr_start name=host2
updtr_add name=policy_all interval=5000000 push=onchange auto_interval=false
updtr_prdcr_add name=policy_all regex=.*
updtr_start name=policy_all
On host3, set up the environment as above and run a daemon:
ldmsd -x sock:10445 -c agg11_push.conf -l /tmp/demo_ldmsd_log -v DEBUG -a munge
Run ldms_ls on the aggregator node to see set listing:
ldms_ls -h localhost -x sock -p 10445 -a munge
Output:
host1/meminfo
host1/vmstat
host2/meminfo
host2/vmstat
Two Aggregators Configured as Failover Pairs
Use same sampler configurations as above
Make a configuration file (called agg11.conf) to aggregate from one sampler with the following contents:
prdcr_add name=host1 host=host1 type=active xprt=sock port=10444 interval=20000000
prdcr_start name=host1
updtr_add name=policy_all interval=1000000 offset=100000
updtr_prdcr_add name=policy_all regex=.*
updtr_start name=policy_all
failover_config host=host3 port=10446 xprt=sock type=active interval=1000000 peer_name=agg12 timeout_factor=2
failover_start
On host3, set up the environment as above and run two daemons as follows:
ldmsd -x sock:10445 -c agg11.conf -l /tmp/demo_ldmsd_log -v ERROR -n agg11 -a munge
ldmsd -x sock:10446 -c agg12.conf -l /tmp/demo_ldmsd_log -v ERROR -n agg12 -a munge
Run ldms_ls on each aggregator node to see set listing:
ldms_ls -h localhost -x sock -p 10445 -a munge
host1/meminfo
host1/vmstat
ldms_ls -h localhost -x sock -p 10446 -a munge
host2/meminfo
host2/vmstat
Kill one daemon:
kill -SIGTERM <pid of daemon listening on 10445>
Make sure it died
Run ldms_ls on the remaining aggregator to see set listing:
ldms_ls -h localhost -x sock -p 10446 -a munge
Output:
host1/meminfo
host1/vmstat
host2/meminfo
host2/vmstat
Set Groups
A set group is an LDMS set with special information to represent a group of sets inside ldmsd. A set group would appear as a regular LDMS set to other LDMS applications, but ldmsd and ldms_ls will treat it as a collection of LDMS sets. If ldmsd updtr updates a set group, it also subsequently updates all the member sets. Performing ldms_ls -l on a set group will also subsequently perform a long-query all the sets in the group.
To illustrate how a set group works, we will configure 2 sampler daemons with set groups and 1 aggregator daemon that updates and stores the groups in the following subsections.
Creating a set group and inserting sets into it
The following is a configuration file for our s0 LDMS daemon (sampler #0) that collects sda disk stats in the s0/sda set and lo network usage in the s0/lo set. The s0/grp set group is created to contain both s0/sda and s0/lo.
### s0.conf
load name=procdiskstats
config name=procdiskstats device=sda producer=s0 instance=s0/sda
start name=procdiskstats interval=1000000 offset=0
load name=procnetdev
config name=procnetdev ifaces=lo producer=s0 instance=s0/lo
start name=procnetdev interval=1000000 offset=0
setgroup_add name=s0/grp producer=s0 interval=1000000 offset=0
setgroup_ins name=s0/grp instance=s0/sda,s0/lo
The following is the same for s1 sampler daemon, but with different devices (sdb and eno1).
### s1.conf
load name=procdiskstats
config name=procdiskstats device=sdb producer=s1 instance=s1/sdb
start name=procdiskstats interval=1000000 offset=0
load name=procnetdev
config name=procnetdev ifaces=eno1 producer=s1 instance=s1/eno1
start name=procnetdev interval=1000000 offset=0
setgroup_add name=s1/grp producer=s1 interval=1000000 offset=0
setgroup_ins name=s1/grp instance=s1/sdb,s1/eno1
The s0 LDMS daemon is listening on port 10000 and the s1 LDMS daemon is listening on port 10001.
Perform ldms_ls on a group
Performing ldms_ls -v or ldms_ls -l on a LDMS daemon hosting a group will perform the query on the set representing the group itself as well as iteratively querying the group’s members.
Example:
ldms_ls -h localhost -x sock -p 10000
Output:
ldms_ls -h localhost -x sock -p 10000 -v s0/grp | grep consistent
Output:
s0/grp: consistent, last update: Mon May 20 15:44:30 2019 -0500 [511879us]
s0/lo: consistent, last update: Mon May 20 16:13:16 2019 -0500 [1126us]
s0/sda: consistent, last update: Mon May 20 16:13:17 2019 -0500 [1176us]
ldms_ls -h localhost -x sock -p 10000 -v s0/lo | grep consistent # only query lo set from set group s0
Note
The update time of the group set is the time that the last set was inserted into the group.
Update / store with set group
The following is an example of an aggregator configuration to match-update only the set groups, and their members, with storage policies:
# Stores
load name=store_csv
config name=store_csv path=csv
# strgp for netdev, csv file: "./csv/net/procnetdev"
strgp_add name=store_net plugin=store_csv container=net schema=procnetdev
strgp_prdcr_add name=store_net regex=.*
strgp_start name=store_net
# strgp for diskstats, csv file: "./csv/disk/procdiskstats"
strgp_add name=store_disk plugin=store_csv container=disk schema=procdiskstats
strgp_prdcr_add name=store_disk regex=.*
strgp_start name=store_disk
# Updater that updates only groups
updtr_add name=u interval=1000000 offset=500000
updtr_match_add name=u regex=ldmsd_grp_schema match=schema
updtr_prdcr_add name=u regex=.*
updtr_start name=u
Performing ldms_ls on the LDMS aggregator daemon exposes all the sets (including groups)
ldms_ls -h localhost -x sock -p 9000
Output:
s1/sdb
s1/grp
s1/eno1
s0/sda
s0/lo
s0/grp
Performing ldms_ls -v on a LDMS daemon hosting a group again but only querying the group and its members:
ldms_ls -h localhost -x sock -p 9000 -v s1/grp | grep consistent
Output:
s1/grp: consistent, last update: Mon May 20 15:42:34 2019 -0500 [891643us]
s1/sdb: consistent, last update: Mon May 20 16:38:38 2019 -0500 [1805us]
s1/eno1: consistent, last update: Mon May 20 16:38:38 2019 -0500 [1791us]
The following is an example of the CSV output:
> head csv/*/*
#Time,Time_usec,ProducerName,component_id,job_id,app_id,reads_comp#sda,reads_comp.rate#sda,reads_merg#sda,reads_merg.rate#sda,sect_read#sda,sect_read.rate#sda,time_read#sda,time_read.rate#sda,writes_comp#sda,writes_comp.rate#sda,writes_merg#sda,writes_merg.rate#sda,sect_written#sda,sect_written.rate#sda,time_write#sda,time_write.rate#sda,ios_in_progress#sda,ios_in_progress.rate#sda,time_ios#sda,time_ios.rate#sda,weighted_time#sda,weighted_time.rate#sda,disk.byte_read#sda,disk.byte_read.rate#sda,disk.byte_written#sda,disk.byte_written.rate#sda
1558387831.001731,1731,s0,0,0,0,197797,0,9132,0,5382606,0,69312,0,522561,0,446083,0,418086168,0,966856,0,0,0,213096,0,1036080,0,1327776668,0,1380408297,0
1558387832.001943,1943,s1,0,0,0,108887,0,32214,0,1143802,0,439216,0,1,0,0,0,8,0,44,0,0,0,54012,0,439240,0,1309384656,0,1166016512,0
1558387832.001923,1923,s0,0,0,0,197797,0,9132,0,5382606,0,69312,0,522561,0,446083,0,418086168,0,966856,0,0,0,213096,0,1036080,0,1327776668,0,1380408297,0
1558387833.001968,1968,s1,0,0,0,108887,0,32214,0,1143802,0,439216,0,1,0,0,0,8,0,44,0,0,0,54012,0,439240,0,1309384656,0,1166016512,0
1558387833.001955,1955,s0,0,0,0,197797,0,9132,0,5382606,0,69312,0,522561,0,446083,0,418086168,0,966856,0,0,0,213096,0,1036080,0,1327776668,0,1380408297,0
1558387834.001144,1144,s1,0,0,0,108887,0,32214,0,1143802,0,439216,0,1,0,0,0,8,0,44,0,0,0,54012,0,439240,0,1309384656,0,1166016512,0
1558387834.001121,1121,s0,0,0,0,197797,0,9132,0,5382606,0,69312,0,522561,0,446083,0,418086168,0,966856,0,0,0,213096,0,1036080,0,1327776668,0,1380408297,0
1558387835.001179,1179,s0,0,0,0,197797,0,9132,0,5382606,0,69312,0,522561,0,446083,0,418086168,0,966856,0,0,0,213096,0,1036080,0,1327776668,0,1380408297,0
1558387835.001193,1193,s1,0,0,0,108887,0,32214,0,1143802,0,439216,0,1,0,0,0,8,0,44,0,0,0,54012,0,439240,0,1309384656,0,1166016512,0
==> csv/net/procnetdev <==
#Time,Time_usec,ProducerName,component_id,job_id,app_id,rx_bytes#lo,rx_packets#lo,rx_errs#lo,rx_drop#lo,rx_fifo#lo,rx_frame#lo,rx_compressed#lo,rx_multicast#lo,tx_bytes#lo,tx_packets#lo,tx_errs#lo,tx_drop#lo,tx_fifo#lo,tx_colls#lo,tx_carrier#lo,tx_compressed#lo
1558387831.001798,1798,s0,0,0,0,12328527,100865,0,0,0,0,0,0,12328527,100865,0,0,0,0,0,0
1558387832.001906,1906,s0,0,0,0,12342153,100925,0,0,0,0,0,0,12342153,100925,0,0,0,0,0,0
1558387832.001929,1929,s1,0,0,0,3323644475,2865919,0,0,0,0,0,12898,342874081,1336419,0,0,0,0,0,0
1558387833.002001,2001,s0,0,0,0,12346841,100939,0,0,0,0,0,0,12346841,100939,0,0,0,0,0,0
1558387833.002025,2025,s1,0,0,0,3323644475,2865919,0,0,0,0,0,12898,342874081,1336419,0,0,0,0,0,0
1558387834.001106,1106,s0,0,0,0,12349089,100953,0,0,0,0,0,0,12349089,100953,0,0,0,0,0,0
1558387834.001130,1130,s1,0,0,0,3323647234,2865923,0,0,0,0,0,12898,342875727,1336423,0,0,0,0,0,0
1558387835.001247,1247,s0,0,0,0,12351337,100967,0,0,0,0,0,0,12351337,100967,0,0,0,0,0,0
1558387835.001274,1274,s1,0,0,0,3323647298,2865924,0,0,0,0,0,12898,342875727,1336423,0,0,0,0,0,0