May 15, 2023

Celestia. Evaluating kernel panic caused by ?

At this article I am gonna show how to track kernel panic caused by some proccess at server. During blockrace testnet many people faced this error, but dont know how to get logs and evaluate it. Lets find our why there is nothing at server logs while we are totally sure issue took place.

First of all. What is kernel panic? What definition could OpenAi give to us?

Kernel Panic is a critical error that occurs in operating systems, such as Linux, macOS, and Unix-based systems, when the operating system detects an internal error from which it cannot recover.

It is a type of error message that occurs when the kernel, which is the core part of the operating system, encounters a fatal error that prevents it from continuing normal operations. The system typically displays a message on the screen indicating that a kernel panic has occurred and then shuts down.

Kernel panic can occur due to a variety of reasons, including hardware failure, software bugs, driver issues, or a corrupt filesystem. It is a serious error that can lead to data loss and system instability, and it often requires a reboot or a complete reinstallation of the operating system to fix the problem.

Definition is pretty much correct. But let me add a little.

When kernel happens, there is absolutely no way it could be save at any ubuntu logs.

The reason is simple.During this error all processes are stuck. Drives are dead, CPU and RAM is dead. Even TCP networks are dead. And the only way is to try to use UDP to export them We can't write anything at our machine, so we won't be able to track the reason. So how to handle it?

Kernel netconsole will help us.

Kernel netconsole is a feature in the Linux kernel that allows you to log kernel messages over a network connection. It can be useful for debugging kernel issues and diagnosing system crashes, especially in situations where physical access to the system is not possible.

When enabled, kernel netconsole sends kernel messages to a remote host over the network, allowing you to view the messages in real-time on another machine. This can be done either using the UDP or TCP protocol.

To set up kernel netconsole, you first need to configure the kernel to support netconsole and specify the IP address and port number of the remote host where you want to send the messages. You also need to configure the remote host to receive the netconsole messages.

Once configured, the kernel netconsole can be used to troubleshoot kernel issues by providing detailed information about the kernel's behavior and error messages that occur during system operation. This can help you to identify the root cause of a problem and take appropriate action to resolve it.

We gonna need extra one machine

On a machine that gonna receive our logs we need to listen to any port

netcat -u -l -p 9999

It is recommended to do in in a screen and a loop

while true do echo -n 'start > ' netcat -u -l -p 9999 echo -n 'stop < ' date done

At the main server we need install netconsole

sudo apt install netconsole

Prepare netconsole:

netconsole-setup [email protected]

This command will send all events of the core to the server loghist2.net to port 9999. It will find out all ip addresses, routes and mac addresses without user interraction and after set up send test msg, that we could find at the 1st server.

Server we wont to get logs from
Server with our kernel logs

As you can see, all addresses where calculated without our help. After that there will msg at ls /sys/kernel/config/netconsole/

What did we get after celestia crash?

We were managed to get the next kernel logs:

Call Trace:
<IRQ>
? __mkroute_output+0x188/0x530
icmp_glue_bits+0x2a/0xa0
__ip_append_data+0x9ab/0xe40
? icmp_push_reply+0x130/0x130
? icmp_push_reply+0x130/0x130
ip_append_data+0x7b/0xe0
icmp_push_reply+0x55/0x130
__icmp_send+0x533/0x7b0
? nf_ct_get_tuple+0x14c/0x1f0 [nf_conntrack]
[? __cgroup_bpf_run_filter_skb+0x45c/0x480
ip_fragment.constprop.0+0x7e/0x90
? ip_fragment.constprop.0+0x7e/0x90
__ip_finish_output+0xa4/0x180
ip_finish_output+0x2e/0xc0
ip_output+0x78/0x100
? __ip_finish_output+0x180/0x180
ip_local_out+0x5e/0x70
__ip_queue_xmit+0x184/0x440
? tcp_syn_options+0x1f9/0x300
ip_queue_xmit+0x15/0x20
__tcp_transmit_skb+0x910/0x9c0
__tcp_retransmit_skb+0x197/0x540
? tcp_out_of_resources+0x3f/0xe0
? tcp_write_timeout+0x36e/0x4c0
tcp_retransmit_skb+0x19/0xd0
tcp_retransmit_timer+0x35f/0x640
tcp_write_timer_handler+0xd5/0x100
tcp_write_timer+0xa2/0xf0
? tcp_write_timer_handler+0x100/0x100
call_timer_fn+0x2c/0x120
__run_timers.part.0+0x1e3/0x270
? ktime_get+0x46/0xc0

As we can see this issue is related to packages flooding. Our system can't handle requests and crashes. This could be related both to kernel version, or celestia too.

What can we do to try to fix it without help of the team? TBH, not much

All we can do is to update our kernel from 5.15.0-69 to 5.15.0-70

To check your version type uname -a

After that upgrade to the latest version

apt --only-upgrade install linux-image-generic

p.s this issue seems to be fixed at the newest celestia versions. And core was updated as well. I hope this small guide will help someone to get some knowleadge how to handle kernel panics