blog.polynom.me/content/blog/2020-01-03-Selfhosting-Lessons.md

<!-- title: Lessons Learned From Self-Hosting -->
<!-- render: yes -->
Roughly eight months ago, according to my hosting provider, I spun up my VM which
I use to this day to self-host my chat, my mail, my git and so on. At the beginning, I thought that
it would allow me both to get away from proprietary software and to learn Linux administration. While
my first goal was met without any problems, the second one I achieved in ways I did not anticipate.

During these eight months, I learned quite a lot. Not by reading documentation, but by messing up
deployments. So this post is my telling of how I messed up and what lessons I learned from it.

# Lesson 1: Document everything
I always tell people that you should document your code. When asked why I answer that you won't
remember what that line does when you have not looked at your codebase for weeks or months.

What I did not realise is that this also applies to administration. I only wrote basic documentation
like a howto for certificate generation or a small troubleshooting guide. This, however, missed the most
important thing to document: the entire infrastructure.

Whenever I needed to look up my port mapping, what did I do? I opened up my *Docker compose* configuration
and search for the port mappings. What did I do when I wanted to know what services I have? Open my
*nginx* configuration and search for `server` directives.

This is a very slow process since I have to remember what services I have behind a reverse proxy and which
ones I have simply exposed. This lead me in the end to creating a folder - called `docs` - in which
I document everything. What certificates are used by what and where they are, port mappings, a graph
showing the dependencies of my services, ... While it may be tedious to create at first, it will really
help.

```
[World]
+
|
+-[443]-[nginx]-+-(blog.polynom.me)
                +-(git.polynom.me)-[gitea]
```

Above, you can see an excerpt from my *"network graph"*.

# Lesson 2: Version Control everything
Version Control Systems are a great thing. Want to try something out? Branch, try out and then either
merge back or roll back. Want to find out what changes broke something? Diff the last revisions and narrow
down your "search space". Want to know what you did? View the log.

While it might seem unneccessary, it helps me keep my cool, knowing that if I ever mess up my configuration, I
can just roll back the configuration from within git.

# Lesson 3: Have a test environment
While I was out once, I connected to a public Wifi. There, however, I could not connect to my VPN. It simply
did not work. A bit later, my Jabber client *Conversations* told me that it could not find my server. After
some thinking, I came to the conclusion that the provider of said public Wifi was probably blocking port `5222`
*(XMPP Client-to-Server)* and whatever port the VPN is using. As such, I wanted to change the port my
Jabber server uses. Since I do not have a failover server I tried testing things out locally, but gave up
after some time and just went and "tested in production". Needless to say that this was a bad idea. At first,
*Conversations* did not do a DNS lookup to see the changed XMPP port, which lead me to removing the DNS entry.
However, after some time - probably after the DNS change propagated far enough - *Conversations* said that it
could not find the server, even though it was listening on port `5222`. Testing with the new port yieled
success.

This experience was terrible for me. Not only was it possible that I broke my Jabber server, but it would
annoy everyone I got to install a Jabber client to talk to me as it would display *"Cannot connect to..."*.
If I had tested this locally, I probably would have been much calmer. In the end, I nervously watched as everyone
gradually reconnected...

# Lesson 4: Use tools and write scripts
The first server I ever got I provisioned manually. I mean, back then it made sense: It was a one-time provisioning and nothing should
change after the initial deployment. But now that I have a continually evolving server, I somehow need to document every step in case
I ever need to provision the same server again.

In my case it is *Ansible*. In my playbook I keep all the roles, e.g. *nginx*, *matterbridge*, *prosody*, separate and apply them to my one
server. In there I also made **heavy** use of templates. The reason for it is that before I started my [*"Road to FOSS"*](https://blog.polynom.me/Road-to-Foss.html)
I used a different domain that I had lying around. Changing the domain name manually would have been a very tedious process, so I decided to use
templates from the get-go. To make my life easier in case I ever change domains again, I defined all my domain names based on my `domain` variable.
The domain for git is defined as {% raw %}`git.{{ domain }}`{% endraw %}, the blog one as {% raw %}`blog.{{ domain }}`{% endraw %}.
Additionally, I make use of *Ansible Vaults*, allowing me to have encrypted secrets in my playbook.

During another project, I also set up an *Ansible* playbook. There, however, I did not use templates. I templated the configuration files using a Makefile
that was calling `sed` to replace the patterns. Not only was that a fragile method, it was also unneeded as *Ansible* was already providing
this functionality for me. I was just wasting my own time.

What I also learned was that one *Ansible* playbook is not enough. While it is nice to automatically provision a server using *Ansible*, there are other things
that need to be done. Certificates don't rotate themselves. From that, I derived a rule stating that if a task needs to be done more than once, then it is
time to write a script for it.

# Lesson 4.1: Automate
Closely tied to the last point: If a task needs to be performed, then you should consider creating a cronjob, or a systemd timer if that is more your thing,
to automatically run it. You don't want to enjoy your day, only for it to be ruined by an expired certificate causing issues.

Since automated cronjobs can cause trouble aswell, I decided to run all automated tasks on days at a time during which I am like to be able to react. As such, it is very
important to notify yourself of those automated actions. My certificate rotation, for example, sends me an eMail at the end, telling me if the certificates
were successfully rotated and if not, which ones failed. For those cases, I also keep a log of the rotation process somewhere else so that I can review it.

# Lesson 5: Unexpected things happen
After having my shiny server run for some time, I was happy. It was basically running itself. Until *Conversations* was unable to contact my server,
connected to a public Wifi. This is something that I did not anticipate, but happened nevertheless.

This means that my deployment was not a run-and-forget solution but a constantly evolving system, where small improvements are periodically added.

# Conclusion
I thought I would just write down my thoughts on all the things that went wrong over the course of my self-hosting adventure. They may not
be best practices, but things that really helped me a lot.

Was the entire process difficult? At first. Was the experience an opportunity to learn? Absolutely! Was it fun? Definitely.