通过监控抓出 Java 异常情况下的线程现场

梦康 2022-11-24 22:49:56 733

线上系统偶尔出现响应慢(rt 陡升),cpu 使用率飙升的情况。但是有时候又不在电脑前,所以需要一些监控的脚本来帮我们保留现场。

下面通过两个案例,一个使用 java 本身的技术栈来解决,一个用 shell 来解决。

RT 过高

这种情况,可能 qps 过高,可能是慢 sql ,可能是有 rpc 调用阻塞等情况。我们可以通过 springboot 计划任务来部署监控,这样就免去了扩容的时候忘记增加单独的监控脚本的事。

纯 Java 代码

不过这代码是真的多

public class JVMUtils {
    public static int blockedTheadNum() {
        int count = 0;
        ThreadMXBean threadMXBean = ManagementFactory.getThreadMXBean();
        long[] allThreadIds = threadMXBean.getAllThreadIds();
        ThreadInfo[] threadInfos = threadMXBean.getThreadInfo(allThreadIds, true, true);
        for (ThreadInfo threadInfo : threadInfos) {
            if (threadInfo.getThreadState().equals(BLOCKED)) {
                count++;
            }
        }

        return count;
    }

    public static void jstack(OutputStream stream) throws Exception {
        ThreadMXBean threadMXBean = ManagementFactory.getThreadMXBean();
        long[] allThreadIds = threadMXBean.getAllThreadIds();
        ThreadInfo[] threadInfos = threadMXBean.getThreadInfo(allThreadIds, true, true);
        for (ThreadInfo threadInfo : threadInfos) {
            if (threadInfo != null) {
                stream.write(printThreadInfoDepth(threadInfo, 1000).getBytes());
            }
        }
    }

    static String printThreadInfoDepth(ThreadInfo threadInfo, int depth) {
        StringBuilder sb = new StringBuilder("\"" + threadInfo.getThreadName() + "\"" +
                " Id=" + threadInfo.getThreadId() + " " +
                threadInfo.getThreadState());
        if (threadInfo.getLockName() != null) {
            sb.append(" on " + threadInfo.getLockName());
        }
        if (threadInfo.getLockOwnerName() != null) {
            sb.append(" owned by \"" + threadInfo.getLockOwnerName() +
                    "\" Id=" + threadInfo.getLockOwnerId());
        }
        if (threadInfo.isSuspended()) {
            sb.append(" (suspended)");
        }
        if (threadInfo.isInNative()) {
            sb.append(" (in native)");
        }
        sb.append('\n');
        int i = 0;
        for (; i < threadInfo.getStackTrace().length && i < depth; i++) {
            StackTraceElement ste = threadInfo.getStackTrace()[i];
            sb.append("\tat " + ste.toString());
            sb.append('\n');
            if (i == 0 && threadInfo.getLockInfo() != null) {
                Thread.State ts = threadInfo.getThreadState();
                switch (ts) {
                    case BLOCKED:
                        sb.append("\t-  blocked on " + threadInfo.getLockInfo());
                        sb.append('\n');
                        break;
                    case WAITING:
                        sb.append("\t-  waiting on " + threadInfo.getLockInfo());
                        sb.append('\n');
                        break;
                    case TIMED_WAITING:
                        sb.append("\t-  waiting on " + threadInfo.getLockInfo());
                        sb.append('\n');
                        break;
                    default:
                }
            }

            for (MonitorInfo mi : threadInfo.getLockedMonitors()) {
                if (mi.getLockedStackDepth() == i) {
                    sb.append("\t-  locked " + mi);
                    sb.append('\n');
                }
            }
        }
        if (i < threadInfo.getStackTrace().length) {
            sb.append("\t...");
            sb.append('\n');
        }

        LockInfo[] locks = threadInfo.getLockedSynchronizers();
        if (locks.length > 0) {
            sb.append("\n\tNumber of locked synchronizers = " + locks.length);
            sb.append('\n');
            for (LockInfo li : locks) {
                sb.append("\t- " + li);
                sb.append('\n');
            }
        }
        sb.append('\n');
        return sb.toString();
    }

}

下面使用的阿里云 schedulerx 云服务,大家使用 springboot scheduler 是一样的

@Component
@Slf4j
public class BlockedThreadMonitorJobProcessor extends JavaProcessor {

    private int thresholdNum = 30;

    @Override
    public ProcessResult process(JobContext context) throws Exception {

        String args = context.getJobParameters();

        try {
            if (StringUtils.isNotBlank(args)) {
                thresholdNum = Integer.parseInt(args);
            }
        } catch (Exception e) {
            log.error(e.getMessage());
        }

        int num = JVMUtils.blockedTheadNum();

        if (num < thresholdNum) {
            return new ProcessResult(true);
        }

        SimpleDateFormat format = new SimpleDateFormat("yyyy-MM-dd HH");
        String date = format.format(new Date()).replace(" ","-");

        File file = new File("/home/admin/logs/", "jstack-" + date + ".log");

        if (file.exists()){
            log.info("BlockedThreadMonitor output ignore");
            return new ProcessResult(true);
        }

        log.info("BlockedThreadMonitor num:{}", num);

        FileOutputStream jstackStream = new FileOutputStream(file);

        JVMUtils.jstack(jstackStream);

        return new ProcessResult(true);
    }
}

Java 里面调用 shell

核心是使用jstack工具,举个例子

# 查看是否阻塞的线程数
jstack 12345|grep BLOCKED|wc -l
jstack 12345 > jstack.log

在 java 里就可以用是ProcessBuilder来调用 shell

@Component
@Slf4j
public class BlockedThreadMonitorJobProcessor extends JavaProcessor {

    @Override
    public ProcessResult process(JobContext context) throws Exception {
        ProcessBuilder pb = new ProcessBuilder("jstack", getPid());

        pb.redirectOutput(appendTo(new File("jstack-"+ DateUtils.formatLongDate(new Date()) + ".log")));
        Process p = pb.start();
        log.info("process exit:{}",p.exitValue());

        return new ProcessResult(true);
    }

    private String getPid(){
        RuntimeMXBean bean = ManagementFactory.getRuntimeMXBean();

        // 获取代表正在运行的Java虚拟机的名称。
        //它返回类似于6460 @ AURORA的信息。凡价值
        // @符号之前是PID。
        String jvmName = bean.getName();
        log.info("Name = {}", jvmName);

        // 通过拆分由返回的字符串来提取PID
        // bean.getName()方法。
        return jvmName.split("@")[0];
    }

}

CPU 使用率过高

比如想找到 cpu 达到80%的时候的最忙碌的 top n 线程,可以使用 arthas 的 thread -n 功能,但是 arthas 是交互式的,需要我们补充点管道的知识,也就是下面的mknod arthas_input p

之前写的公众号 https://mp.weixin.qq.com/s/OZdmTn3emTdFjlGY2-2fDg

#!/bin/bash

ctime=$(date "+%H-%M-%S")
cpu_percent=`top -n 1|grep Cpu|awk -F " " '{print int($2)}'`

if [ ${cpu_percent} -gt 80 ] ; then
  echo "${ctime}开始采集最占 cpu 的线程"
  mknod arthas_input p
  exec 8<> arthas_input
  ./as.sh <&8 &
  echo -e "1\n" >> arthas_input

  echo "thread -n 10 > $(pwd)/arthas.${cpu_percent}.${ctime}.result" >> arthas_input
  echo "quit" >> arthas_input
  rm -f arthas_input
  sleep 2s
fi