线上系统偶尔出现响应慢(rt 陡升),cpu 使用率飙升的情况。但是有时候又不在电脑前,所以需要一些监控的脚本来帮我们保留现场。
下面通过两个案例,一个使用 java 本身的技术栈来解决,一个用 shell 来解决。
RT 过高
这种情况,可能 qps 过高,可能是慢 sql ,可能是有 rpc 调用阻塞等情况。我们可以通过 springboot 计划任务来部署监控,这样就免去了扩容的时候忘记增加单独的监控脚本的事。
纯 Java 代码
不过这代码是真的多
public class JVMUtils {
public static int blockedTheadNum() {
int count = 0;
ThreadMXBean threadMXBean = ManagementFactory.getThreadMXBean();
long[] allThreadIds = threadMXBean.getAllThreadIds();
ThreadInfo[] threadInfos = threadMXBean.getThreadInfo(allThreadIds, true, true);
for (ThreadInfo threadInfo : threadInfos) {
if (threadInfo.getThreadState().equals(BLOCKED)) {
count++;
}
}
return count;
}
public static void jstack(OutputStream stream) throws Exception {
ThreadMXBean threadMXBean = ManagementFactory.getThreadMXBean();
long[] allThreadIds = threadMXBean.getAllThreadIds();
ThreadInfo[] threadInfos = threadMXBean.getThreadInfo(allThreadIds, true, true);
for (ThreadInfo threadInfo : threadInfos) {
if (threadInfo != null) {
stream.write(printThreadInfoDepth(threadInfo, 1000).getBytes());
}
}
}
static String printThreadInfoDepth(ThreadInfo threadInfo, int depth) {
StringBuilder sb = new StringBuilder("\"" + threadInfo.getThreadName() + "\"" +
" Id=" + threadInfo.getThreadId() + " " +
threadInfo.getThreadState());
if (threadInfo.getLockName() != null) {
sb.append(" on " + threadInfo.getLockName());
}
if (threadInfo.getLockOwnerName() != null) {
sb.append(" owned by \"" + threadInfo.getLockOwnerName() +
"\" Id=" + threadInfo.getLockOwnerId());
}
if (threadInfo.isSuspended()) {
sb.append(" (suspended)");
}
if (threadInfo.isInNative()) {
sb.append(" (in native)");
}
sb.append('\n');
int i = 0;
for (; i < threadInfo.getStackTrace().length && i < depth; i++) {
StackTraceElement ste = threadInfo.getStackTrace()[i];
sb.append("\tat " + ste.toString());
sb.append('\n');
if (i == 0 && threadInfo.getLockInfo() != null) {
Thread.State ts = threadInfo.getThreadState();
switch (ts) {
case BLOCKED:
sb.append("\t- blocked on " + threadInfo.getLockInfo());
sb.append('\n');
break;
case WAITING:
sb.append("\t- waiting on " + threadInfo.getLockInfo());
sb.append('\n');
break;
case TIMED_WAITING:
sb.append("\t- waiting on " + threadInfo.getLockInfo());
sb.append('\n');
break;
default:
}
}
for (MonitorInfo mi : threadInfo.getLockedMonitors()) {
if (mi.getLockedStackDepth() == i) {
sb.append("\t- locked " + mi);
sb.append('\n');
}
}
}
if (i < threadInfo.getStackTrace().length) {
sb.append("\t...");
sb.append('\n');
}
LockInfo[] locks = threadInfo.getLockedSynchronizers();
if (locks.length > 0) {
sb.append("\n\tNumber of locked synchronizers = " + locks.length);
sb.append('\n');
for (LockInfo li : locks) {
sb.append("\t- " + li);
sb.append('\n');
}
}
sb.append('\n');
return sb.toString();
}
}
下面使用的阿里云 schedulerx 云服务,大家使用 springboot scheduler 是一样的
@Component
@Slf4j
public class BlockedThreadMonitorJobProcessor extends JavaProcessor {
private int thresholdNum = 30;
@Override
public ProcessResult process(JobContext context) throws Exception {
String args = context.getJobParameters();
try {
if (StringUtils.isNotBlank(args)) {
thresholdNum = Integer.parseInt(args);
}
} catch (Exception e) {
log.error(e.getMessage());
}
int num = JVMUtils.blockedTheadNum();
if (num < thresholdNum) {
return new ProcessResult(true);
}
SimpleDateFormat format = new SimpleDateFormat("yyyy-MM-dd HH");
String date = format.format(new Date()).replace(" ","-");
File file = new File("/home/admin/logs/", "jstack-" + date + ".log");
if (file.exists()){
log.info("BlockedThreadMonitor output ignore");
return new ProcessResult(true);
}
log.info("BlockedThreadMonitor num:{}", num);
FileOutputStream jstackStream = new FileOutputStream(file);
JVMUtils.jstack(jstackStream);
return new ProcessResult(true);
}
}
Java 里面调用 shell
核心是使用jstack
工具,举个例子
# 查看是否阻塞的线程数
jstack 12345|grep BLOCKED|wc -l
jstack 12345 > jstack.log
在 java 里就可以用是ProcessBuilder
来调用 shell
@Component
@Slf4j
public class BlockedThreadMonitorJobProcessor extends JavaProcessor {
@Override
public ProcessResult process(JobContext context) throws Exception {
ProcessBuilder pb = new ProcessBuilder("jstack", getPid());
pb.redirectOutput(appendTo(new File("jstack-"+ DateUtils.formatLongDate(new Date()) + ".log")));
Process p = pb.start();
log.info("process exit:{}",p.exitValue());
return new ProcessResult(true);
}
private String getPid(){
RuntimeMXBean bean = ManagementFactory.getRuntimeMXBean();
// 获取代表正在运行的Java虚拟机的名称。
//它返回类似于6460 @ AURORA的信息。凡价值
// @符号之前是PID。
String jvmName = bean.getName();
log.info("Name = {}", jvmName);
// 通过拆分由返回的字符串来提取PID
// bean.getName()方法。
return jvmName.split("@")[0];
}
}
CPU 使用率过高
比如想找到 cpu 达到80%的时候的最忙碌的 top n 线程,可以使用 arthas 的 thread -n
功能,但是 arthas 是交互式的,需要我们补充点管道的知识,也就是下面的mknod arthas_input p
#!/bin/bash
ctime=$(date "+%H-%M-%S")
cpu_percent=`top -n 1|grep Cpu|awk -F " " '{print int($2)}'`
if [ ${cpu_percent} -gt 80 ] ; then
echo "${ctime}开始采集最占 cpu 的线程"
mknod arthas_input p
exec 8<> arthas_input
./as.sh <&8 &
echo -e "1\n" >> arthas_input
echo "thread -n 10 > $(pwd)/arthas.${cpu_percent}.${ctime}.result" >> arthas_input
echo "quit" >> arthas_input
rm -f arthas_input
sleep 2s
fi