blog.polynom.me/content/2019-08-21-Mainlin-Hero-2.md

170 lines
13 KiB
Markdown
Raw Normal View History

2024-01-05 17:10:44 +00:00
+++
title = "Mainline Hero Part 1 - First Attempts At Porting"
date = "2019-08-21"
template = "post.html"
aliases = [ "/Mainline-Hero-1.html" ]
+++
In the first post of the series, I showed what information I gathered and what tricks can be used
to debug our mainline port of the *herolte* kernel. While I learned a lot just by preparing for
the actual porting, I was not able to actually get as close as to booting the kernel. I would have
liked to write about what I did to *actually* boot a *5.X.X* kernel on the device, but instead I will tell you
about the journey I completed thus far.
<!-- more -->
If you are curious about the progress I made, you can find the patches [here]({{ site.social.git_url}}/herolte-mainline). The first patches I produced are in the `patches/` directory, while the ones I created with lower
expectations are in the `patches_v2/` directory. Both "patchsets" are based on the `linux-next` source.
## Starting Out
My initial expectations about mainlining were simple: *The kernel should at least boot and then perhaps
crash in some way I can debug*.
This, however, was my first mistake: Nothing is that easy! Ignoring this, I immeditately began writing
up a *Device Tree* based on the original downstream source. This was the first big challenge as the amount of
downstream *Device Tree* files is overwhelming:
```
$ wc -l exynos* | awk -F\ '{print $1}' | awk '{sum += $1} END {print sum}'
54952
```
But I chewed through most of them by just looking for interesting nodes like `cpu` or `memory`, after which
I transfered them into a new simple *Device Tree*. At this point I learned that the *Github* search does not
work as well as I thought it does. It **does** find what I searched for. But only sometimes. So how to we find
what we are looking for? By *grep*-ping through the files. Using `grep -i -r cpu .` we are able to search
a directory tree for the keyword `cpu`. But while *grep* does a wonderful job, it is kind of slow. So at that
point I switched over to a tool called `ripgrep` which does these searches a lot faster than plain-old grep.
At some point, I found it very tiring to search for nodes; The reason being that I had to search for specific
nodes without knowing their names or locations. This led to the creation of a script which parses a *Device Tree*
while following includes of other *Device Tree* files, allowing me to search for nodes which have, for example, a
certain attribute set. This script is also included in the "patch repository", however, it does not work perfectly.
It finds most of the nodes but not all of them but was sufficient for my searches.
After finally having the basic nodes in my *Device Tree*, I started to port over all of the required nodes
to enable the serial interface on the SoC. This was the next big mistake I made: I tried to do too much
without verifiying that the kernel even boots. This was also the point where I learned that the *Device Tree*
by itself doesn't really do anything. It just tells the kernel how the SoC looks like so that the correct
drivers can be loaded and initialized. So I knew that I had to port drivers from the downstream kernel into the
mainline kernel. The kernel identifies the corresponding driver by looking at the data that the drivers
expose.
```
[...]
static struct of_device_id ext_clk_match[] __initdata = {
{ .compatible = "samsung,exynos8890-oscclk", .data = (void *)0, },
};
[...]
```
This is an example from the [clock driver](https://github.com/ivanmeler/android_kernel_samsung_herolte/blob/lineage-15.1/drivers/clk/samsung/clk-exynos8890.c#L122) of the downstream kernel.
When the kernel is processing a node of the *Device Tree* it looks for a driver that exposes the same
compatible attribute. In this case, it would be the *Samsung* clock driver.
So at this point I was wildly copying over driver code into the mainline kernel. As I forgot this during the
porting attempt, I am
mentioning my mistake again: I never thought about the possibility that the kernel would not boot at all.
After having "ported" the driver code for the clock and some other devices I decided to try and boot the
kernel. Having my phone plugged into the serial adapter made my terminal show nothing. So I went into the
*S-Boot* console to poke around. There I tried some commands in the hope that the bootloader would initialize
the hardware for me so that it magically makes the kernel boot and give me serial output. One was especially
interesting at that time: The name made it look like it would test whether the processor can do **SMP** -
**S**ymmetric **M**ulti**p**rocessing; *ARM*'s version of *Intel*'s *Hyper Threading* or *AMD*'s *SMT*.
By continuing to boot, I got some output via the serial interface! It was garbage data, but it was data. This
gave me some hope. However, it was just some data that was pushed by something other than the kernel. I checked
this hypothesis by installing the downstream kernel, issuing the same commands and booting the kernel.
## Back To The Drawing Board
At this point I was kind of frustrated. I knew that this endeavour was going to be difficult, but I immensely
underestimated it.
After taking a break, I went back to my computer with a new tactic: Port as few things as possible, confirm that
it boots and then port the rest. This was inspired by the way the *Galaxy Nexus* was mainlined in
[this](https://postmarketos.org/blog/2019/06/23/two-years/) blog post.
What did I do this time? The first step was a minimal *Device Tree*. No clock nodes. No serial nodes. No
GPIO nodes. Just the CPU, the memory and a *chosen* node. Setting the `CONFIG_PANIC_TIMEOUT`
[option](https://cateee.net/lkddb/web-lkddb/PANIC_TIMEOUT.html) to 5, waiting at least 15 seconds and seeing
no reboot, I was thinking that the phone did boot the mainline kernel. But before getting too excited, as I
kept in mind that it was a hugely difficult endeavour, I asked in *postmarketOS*' mainline Matrix channel whether it could happen that the phone panics and still does not reboot. The answer I got
was that it could, indeed, happen. It seems like the CPU does not know how to shut itself off. On the x86 platform, this
is the task of *ACPI*, while on *ARM* [*PSCI*](https://linux-sunxi.org/PSCI), the **P**ower **S**tate
**C**oordination **I**nterface, is responsible for it. Since the mainline kernel knows about *PSCI*, I wondered
why my phone did not reboot. As the result of some thinking I thought up 3 possibilities:
1. The kernel boots just fine and does not panic. Hence no reboot.
2. The kernel panics and wants to reboot but the *PSCI* implementation in the downstream kernel differs from the mainline code.
3. The kernel just does not boot.
The first possibility I threw out of the window immeditately. It was just too easy. As such, I began
investigating the *PSCI* code. Out of curiosity, I looked at the implementation of the `emergency_restart`
function of the kernel and discovered that the function `arm_pm_restart` is used on *arm64*. Looking deeper, I
found out that this function is only set when the *Device Tree* contains a *PSCI* node of a supported version.
The downstream node is compatible with version `0.1`, which does not support the `SYSTEM_RESET` functionality
of *PSCI*. Since I could just turn off or restart the phone when using *Android* or *postmarketOS*, I knew
that there is something that just works around old firmware.
The downstream [*PSCI* node](https://github.com/ivanmeler/android_kernel_samsung_herolte/blob/lineage-15.1/arch/arm64/boot/dts/exynos8890.dtsi#L316) just specifies that it is compatible with `arm,psci`, so
how do I know that it is only firmware version `0.1` and how do I know of this `SYSTEM_RESET`?
If we grep for the compatible attribute `arm,psci` we find it as the value of the `compatible` field in the
source file `arch/arm64/kernel/psci.c`. It [specifies](https://github.com/ivanmeler/android_kernel_samsung_herolte/blob/lineage-15.1/arch/arm64/kernel/psci.c#L381) that the exact attribute of `arm,psci`
results in a call to the function `psci_0_1_init`. This indicates a version of *PSCI*. If we take a look
at *ARM*'s [*PSCI* documentation](http://infocenter.arm.com/help/topic/com.arm.doc.den0022d/Power_State_Coordination_Interface_PDD_v1_1_DEN0022D.pdf)
we find a section called *"Changes in PSCIv0.2 from first proposal"* which contains the information that,
compared to version 0.2, the call `SYSTEM_RESET` was added. Hence we can guess that the *Exynos8890* SoC
comes with firmware which only supports this version 0.1 of *PSCI*.
After a lot of searching, I found a node called `reboot` in the [downstream source](https://github.com/ivanmeler/android_kernel_samsung_herolte/blob/lineage-15.1/arch/arm64/boot/dts/exynos8890.dtsi#L116).
The compatible driver for it is within the [*Samsung* SoC](https://github.com/ivanmeler/android_kernel_samsung_herolte/blob/lineage-15.1/drivers/soc/samsung/exynos-reboot.c) driver code.
Effectively, the way this code reboots the SoC, is by mapping the address of the PMU, which I guess stands for
*Power Management Unit*, into memory and writing some value
to it. This value is probably the command which tells the PMU to reset the SoC.
In my "patchset" *patches_v2* I have ported this code. Testing it with the downstream kernel, it
made the device do something. Although it crashed the kernel, it was enough to debug.
To test the mainline kernel, I added an `emergency_restart` at the beginning of the `start_kernel` function.
The result was that the device did not do anything. The only option I had left was 3; the kernel does not even
boot.
At this point I began investigating the `arch/arm64/` code of the downstream kernel more closely. However, I
noticed something unrelated during a kernel build: The downstream kernel logs something with *FIPS* at the
end of the build. Grepping for it resulted in some code at [the end](https://github.com/ivanmeler/android_kernel_samsung_herolte/blob/lineage-15.1/scripts/link-vmlinux.sh#L253) of the `link-vmlinuz.sh` script. I thought
that it was signing the kernel with a key in the repo, but it probably is doing something else. I tested
whether the downstream kernel boots without these crypto scripts and it did.
The only thing I did not test was whether the kernel boots without
["double-checking [the] jopp magic"](https://github.com/ivanmeler/android_kernel_samsung_herolte/blob/lineage-15.1/scripts/link-vmlinux.sh#L270). But by looking at this script, I noticed another interesting thing:
`CONFIG_RELOCATABLE_KERNEL`. By having just a rough idea of what this config option enables, I removed it
from the downstream kernel and tried to boot. But the kernel did not boot. This meant that this option
was required for booting the kernel. This was the only success I can report.
By grepping for this config option I found the file `arch/arm64/kernel/head.S`. I did not know what it was
for so I searched the internet and found a [thread](https://unix.stackexchange.com/questions/139297/what-are-the-two-head-s-files-in-linux-source)
on *StackOverflow* that explained that the file
is prepended onto the kernel and executed before `start_kernel`. I mainly investigated this file, but in
hindsight I should have also looked more at the other occurences of the `CONFIG_RELOCATABLE_KERNEL` option.
So what I did was try and port over code from the downstream `head.S` into the mainline `head.S`. This is
the point where I am at now. I did not progress any further as I am not used to assembly code or *ARM*
assembly, but I still got some more hypotheses as to why the kernel does not boot.
1. For some reason the CPU never reaches the instruction to jump to `start_kernel`.
2. The CPU fails to initialize the MMU or some other low-level component and thus cannot jump into `start_kernel`.
At the moment, option 2 seems the most likely as the code from the downstream kernel and the mainline kernel
do differ some and I expect that *Samsung* added some code as their MMU might have some quirks that the
mainline kernel does not address. However, I did not have the chance to either confirm or deny any of these
assumptions.
As a bottom line, I can say that the most useful, but in my case most ignored, thing I learned is patience.
During the entire porting process I tried to do as much as I can in the shortest amount of time possible.
However, I quickly realized that I got the best ideas when I was doing something completely different. As
such, I also learned that it is incredibly useful to always have a piece of paper or a text editor handy
to write down any ideas you might have. You never know what might be useful and what not.
I also want to mention that I used the [*Bootlin Elixir Cross Referencer*](https://elixir.bootlin.com/linux/latest/source)
a lot. It is a very useful tool to use when exploring the kernel source tree. However, I would still
recommend to have a local copy so that you can very easily grep through the code and find things that
neither *Github* nor *Elixir* can find.