I'm adding another answer because I just saw this problem in another situation where Prometheus v.2.19 was taking metrics from PCP v.5 on CentOS 7 servers using PMAPI. The scrape was defined as a single task with several metric domains in the Prometheus config file, ie:
The problem was solved by dividing the scrape operation into many jobs, each with only one target, for example:
- job_name: 'pcp'
file_sd_configs:
- files: [...]
metric_path: '/metrics'
params:
target: ['kernel', 'mem', 'disk', 'network', 'mounts', 'lustre', 'infiniband']
Only kernel metrics were gathered when there was a difficulty with one metric domain, usually lustre or infiniband, due to a lack of appropriate hardware on the host.
The problem was solved by dividing the scrape operation into many jobs, each with only one target, for example:
- job_name: 'pcp-kernel'
file_sd_configs:
- files: [...]
metric_path: '/metrics'
params:
target: ['kernel']
- job_name: 'pcp-mem'
file_sd_configs:
- files: [...]
metric_path: '/metrics'
params:
target: ['mem']
[...]
This manner, even if one or all of the extra domains failed, metrics from the core domains were always scraped successfully. This design appears to be more resilient, but it makes the target status view busier due to the increased number of scrape processes.