164 lines
13 KiB
Markdown
164 lines
13 KiB
Markdown
<!-- title: Mainline Hero Part 1 - First Attempts At Porting -->
|
|
|
|
In the first post of the series, I showed what information I gathered and what tricks can be used
|
|
to debug our mainline port of the *herolte* kernel. While I learned a lot just by preparing for
|
|
the actual porting, I was not able to actually get as close as to booting the kernel. I would have
|
|
liked to write about what I did to *actually* boot a *5.X.X* kernel on the device, but instead I will tell you
|
|
about the journey I completed thus far.
|
|
|
|
If you are curious about the progress I made, you can find the patches [here]({{ site.social.git_url}}/herolte-mainline). The first patches I produced are in the `patches/` directory, while the ones I created with lower
|
|
expectations are in the `patches_v2/` directory. Both "patchsets" are based on the `linux-next` source.
|
|
|
|
## Starting Out
|
|
My initial expectations about mainlining were simple: *The kernel should at least boot and then perhaps
|
|
crash in some way I can debug*.
|
|
|
|
This, however, was my first mistake: Nothing is that easy! Ignoring this, I immeditately began writing
|
|
up a *Device Tree* based on the original downstream source. This was the first big challenge as the amount of
|
|
downstream *Device Tree* files is overwhelming:
|
|
```
|
|
$ wc -l exynos* | awk -F\ '{print $1}' | awk '{sum += $1} END {print sum}'
|
|
54952
|
|
```
|
|
|
|
But I chewed through most of them by just looking for interesting nodes like `cpu` or `memory`, after which
|
|
I transfered them into a new simple *Device Tree*. At this point I learned that the *Github* search does not
|
|
work as well as I thought it does. It **does** find what I searched for. But only sometimes. So how to we find
|
|
what we are looking for? By *grep*-ping through the files. Using `grep -i -r cpu .` we are able to search
|
|
a directory tree for the keyword `cpu`. But while *grep* does a wonderful job, it is kind of slow. So at that
|
|
point I switched over to a tool called `ripgrep` which does these searches a lot faster than plain-old grep.
|
|
|
|
At some point, I found it very tiring to search for nodes; The reason being that I had to search for specific
|
|
nodes without knowing their names or locations. This led to the creation of a script which parses a *Device Tree*
|
|
while following includes of other *Device Tree* files, allowing me to search for nodes which have, for example, a
|
|
certain attribute set. This script is also included in the "patch repository", however, it does not work perfectly.
|
|
It finds most of the nodes but not all of them but was sufficient for my searches.
|
|
|
|
After finally having the basic nodes in my *Device Tree*, I started to port over all of the required nodes
|
|
to enable the serial interface on the SoC. This was the next big mistake I made: I tried to do too much
|
|
without verifiying that the kernel even boots. This was also the point where I learned that the *Device Tree*
|
|
by itself doesn't really do anything. It just tells the kernel how the SoC looks like so that the correct
|
|
drivers can be loaded and initialized. So I knew that I had to port drivers from the downstream kernel into the
|
|
mainline kernel. The kernel identifies the corresponding driver by looking at the data that the drivers
|
|
expose.
|
|
|
|
```
|
|
[...]
|
|
static struct of_device_id ext_clk_match[] __initdata = {
|
|
{ .compatible = "samsung,exynos8890-oscclk", .data = (void *)0, },
|
|
};
|
|
[...]
|
|
```
|
|
This is an example from the [clock driver](https://github.com/ivanmeler/android_kernel_samsung_herolte/blob/lineage-15.1/drivers/clk/samsung/clk-exynos8890.c#L122) of the downstream kernel.
|
|
When the kernel is processing a node of the *Device Tree* it looks for a driver that exposes the same
|
|
compatible attribute. In this case, it would be the *Samsung* clock driver.
|
|
|
|
So at this point I was wildly copying over driver code into the mainline kernel. As I forgot this during the
|
|
porting attempt, I am
|
|
mentioning my mistake again: I never thought about the possibility that the kernel would not boot at all.
|
|
|
|
After having "ported" the driver code for the clock and some other devices I decided to try and boot the
|
|
kernel. Having my phone plugged into the serial adapter made my terminal show nothing. So I went into the
|
|
*S-Boot* console to poke around. There I tried some commands in the hope that the bootloader would initialize
|
|
the hardware for me so that it magically makes the kernel boot and give me serial output. One was especially
|
|
interesting at that time: The name made it look like it would test whether the processor can do **SMP** -
|
|
**S**ymmetric **M**ulti**p**rocessing; *ARM*'s version of *Intel*'s *Hyper Threading* or *AMD*'s *SMT*.
|
|
By continuing to boot, I got some output via the serial interface! It was garbage data, but it was data. This
|
|
gave me some hope. However, it was just some data that was pushed by something other than the kernel. I checked
|
|
this hypothesis by installing the downstream kernel, issuing the same commands and booting the kernel.
|
|
|
|
## Back To The Drawing Board
|
|
At this point I was kind of frustrated. I knew that this endeavour was going to be difficult, but I immensely
|
|
underestimated it.
|
|
|
|
After taking a break, I went back to my computer with a new tactic: Port as few things as possible, confirm that
|
|
it boots and then port the rest. This was inspired by the way the *Galaxy Nexus* was mainlined in
|
|
[this](https://postmarketos.org/blog/2019/06/23/two-years/) blog post.
|
|
|
|
What did I do this time? The first step was a minimal *Device Tree*. No clock nodes. No serial nodes. No
|
|
GPIO nodes. Just the CPU, the memory and a *chosen* node. Setting the `CONFIG_PANIC_TIMEOUT`
|
|
[option](https://cateee.net/lkddb/web-lkddb/PANIC_TIMEOUT.html) to 5, waiting at least 15 seconds and seeing
|
|
no reboot, I was thinking that the phone did boot the mainline kernel. But before getting too excited, as I
|
|
kept in mind that it was a hugely difficult endeavour, I asked in *postmarketOS*' mainline Matrix channel whether it could happen that the phone panics and still does not reboot. The answer I got
|
|
was that it could, indeed, happen. It seems like the CPU does not know how to shut itself off. On the x86 platform, this
|
|
is the task of *ACPI*, while on *ARM* [*PSCI*](https://linux-sunxi.org/PSCI), the **P**ower **S**tate
|
|
**C**oordination **I**nterface, is responsible for it. Since the mainline kernel knows about *PSCI*, I wondered
|
|
why my phone did not reboot. As the result of some thinking I thought up 3 possibilities:
|
|
|
|
1. The kernel boots just fine and does not panic. Hence no reboot.
|
|
2. The kernel panics and wants to reboot but the *PSCI* implementation in the downstream kernel differs from the mainline code.
|
|
3. The kernel just does not boot.
|
|
|
|
The first possibility I threw out of the window immeditately. It was just too easy. As such, I began
|
|
investigating the *PSCI* code. Out of curiosity, I looked at the implementation of the `emergency_restart`
|
|
function of the kernel and discovered that the function `arm_pm_restart` is used on *arm64*. Looking deeper, I
|
|
found out that this function is only set when the *Device Tree* contains a *PSCI* node of a supported version.
|
|
The downstream node is compatible with version `0.1`, which does not support the `SYSTEM_RESET` functionality
|
|
of *PSCI*. Since I could just turn off or restart the phone when using *Android* or *postmarketOS*, I knew
|
|
that there is something that just works around old firmware.
|
|
|
|
The downstream [*PSCI* node](https://github.com/ivanmeler/android_kernel_samsung_herolte/blob/lineage-15.1/arch/arm64/boot/dts/exynos8890.dtsi#L316) just specifies that it is compatible with `arm,psci`, so
|
|
how do I know that it is only firmware version `0.1` and how do I know of this `SYSTEM_RESET`?
|
|
|
|
If we grep for the compatible attribute `arm,psci` we find it as the value of the `compatible` field in the
|
|
source file `arch/arm64/kernel/psci.c`. It [specifies](https://github.com/ivanmeler/android_kernel_samsung_herolte/blob/lineage-15.1/arch/arm64/kernel/psci.c#L381) that the exact attribute of `arm,psci`
|
|
results in a call to the function `psci_0_1_init`. This indicates a version of *PSCI*. If we take a look
|
|
at *ARM*'s [*PSCI* documentation](http://infocenter.arm.com/help/topic/com.arm.doc.den0022d/Power_State_Coordination_Interface_PDD_v1_1_DEN0022D.pdf)
|
|
we find a section called *"Changes in PSCIv0.2 from first proposal"* which contains the information that,
|
|
compared to version 0.2, the call `SYSTEM_RESET` was added. Hence we can guess that the *Exynos8890* SoC
|
|
comes with firmware which only supports this version 0.1 of *PSCI*.
|
|
|
|
After a lot of searching, I found a node called `reboot` in the [downstream source](https://github.com/ivanmeler/android_kernel_samsung_herolte/blob/lineage-15.1/arch/arm64/boot/dts/exynos8890.dtsi#L116).
|
|
The compatible driver for it is within the [*Samsung* SoC](https://github.com/ivanmeler/android_kernel_samsung_herolte/blob/lineage-15.1/drivers/soc/samsung/exynos-reboot.c) driver code.
|
|
|
|
Effectively, the way this code reboots the SoC, is by mapping the address of the PMU, which I guess stands for
|
|
*Power Management Unit*, into memory and writing some value
|
|
to it. This value is probably the command which tells the PMU to reset the SoC.
|
|
In my "patchset" *patches_v2* I have ported this code. Testing it with the downstream kernel, it
|
|
made the device do something. Although it crashed the kernel, it was enough to debug.
|
|
|
|
To test the mainline kernel, I added an `emergency_restart` at the beginning of the `start_kernel` function.
|
|
The result was that the device did not do anything. The only option I had left was 3; the kernel does not even
|
|
boot.
|
|
|
|
At this point I began investigating the `arch/arm64/` code of the downstream kernel more closely. However, I
|
|
noticed something unrelated during a kernel build: The downstream kernel logs something with *FIPS* at the
|
|
end of the build. Grepping for it resulted in some code at [the end](https://github.com/ivanmeler/android_kernel_samsung_herolte/blob/lineage-15.1/scripts/link-vmlinux.sh#L253) of the `link-vmlinuz.sh` script. I thought
|
|
that it was signing the kernel with a key in the repo, but it probably is doing something else. I tested
|
|
whether the downstream kernel boots without these crypto scripts and it did.
|
|
|
|
The only thing I did not test was whether the kernel boots without
|
|
["double-checking [the] jopp magic"](https://github.com/ivanmeler/android_kernel_samsung_herolte/blob/lineage-15.1/scripts/link-vmlinux.sh#L270). But by looking at this script, I noticed another interesting thing:
|
|
`CONFIG_RELOCATABLE_KERNEL`. By having just a rough idea of what this config option enables, I removed it
|
|
from the downstream kernel and tried to boot. But the kernel did not boot. This meant that this option
|
|
was required for booting the kernel. This was the only success I can report.
|
|
|
|
By grepping for this config option I found the file `arch/arm64/kernel/head.S`. I did not know what it was
|
|
for so I searched the internet and found a [thread](https://unix.stackexchange.com/questions/139297/what-are-the-two-head-s-files-in-linux-source)
|
|
on *StackOverflow* that explained that the file
|
|
is prepended onto the kernel and executed before `start_kernel`. I mainly investigated this file, but in
|
|
hindsight I should have also looked more at the other occurences of the `CONFIG_RELOCATABLE_KERNEL` option.
|
|
|
|
So what I did was try and port over code from the downstream `head.S` into the mainline `head.S`. This is
|
|
the point where I am at now. I did not progress any further as I am not used to assembly code or *ARM*
|
|
assembly, but I still got some more hypotheses as to why the kernel does not boot.
|
|
|
|
1. For some reason the CPU never reaches the instruction to jump to `start_kernel`.
|
|
2. The CPU fails to initialize the MMU or some other low-level component and thus cannot jump into `start_kernel`.
|
|
|
|
At the moment, option 2 seems the most likely as the code from the downstream kernel and the mainline kernel
|
|
do differ some and I expect that *Samsung* added some code as their MMU might have some quirks that the
|
|
mainline kernel does not address. However, I did not have the chance to either confirm or deny any of these
|
|
assumptions.
|
|
|
|
As a bottom line, I can say that the most useful, but in my case most ignored, thing I learned is patience.
|
|
During the entire porting process I tried to do as much as I can in the shortest amount of time possible.
|
|
However, I quickly realized that I got the best ideas when I was doing something completely different. As
|
|
such, I also learned that it is incredibly useful to always have a piece of paper or a text editor handy
|
|
to write down any ideas you might have. You never know what might be useful and what not.
|
|
|
|
I also want to mention that I used the [*Bootlin Elixir Cross Referencer*](https://elixir.bootlin.com/linux/latest/source)
|
|
a lot. It is a very useful tool to use when exploring the kernel source tree. However, I would still
|
|
recommend to have a local copy so that you can very easily grep through the code and find things that
|
|
neither *Github* nor *Elixir* can find.
|