Files
nix-config/docs/troubleshooting.md
mjallen18 70002a19e2 hmm
2026-04-07 18:39:42 -05:00

218 lines
5.5 KiB
Markdown
Executable File

# Troubleshooting Guide
Common issues and solutions for this NixOS configuration.
## Build Failures
### `nixos-rebuild switch` fails
1. **Syntax error** — the error message includes the file and line number. Common causes: missing `;`, unmatched `{`, wrong type passed to an option.
2. **Evaluation error** — read the full error trace. Often caused by a module option receiving the wrong type, or a missing `cfg.enable` guard.
3. **Fetch failure** — a flake input or package source can't be downloaded. Check network connectivity, or try:
```bash
nix flake update --update-input <input-name>
```
4. **Disk space** — build sandbox fills up. Free space:
```bash
sudo nix-collect-garbage -d
df -h /nix
```
### Assertion failures
If you see `assertion failed`, read the `message` field. For example:
```
error: assertion failed at …/nebula/sops.nix
mjallen.services.nebula.secretsPrefix must be set
```
Set the required option in the system configuration.
## Boot Issues
### System won't boot after a config change
1. At the boot menu, select a previous generation.
2. Once booted, revert the change:
```bash
cd /etc/nixos
git revert HEAD
sudo nixos-rebuild switch --flake .#$(hostname)
```
### Booting from installation media to recover
```bash
# Mount the system (adjust device paths as needed)
sudo mount /dev/disk/by-label/nixos /mnt
sudo mount /dev/disk/by-label/boot /mnt/boot
# Chroot in
sudo nixos-enter --root /mnt
cd /etc/nixos
# Revert and rebuild
git revert HEAD
nixos-rebuild switch --flake .#hostname --install-bootloader
```
### Lanzaboote / Secure Boot issues
If Secure Boot enrolment fails or the system won't verify:
```bash
# Check enrolled keys
sbctl status
# Re-enrol if needed (run as root)
sbctl enrol-keys --microsoft
# Sign bootloader files manually
sbctl sign -s /boot/EFI/systemd/systemd-bootx64.efi
```
## SOPS / Secrets Issues
### `secret not found` or permission denied at boot
1. Verify the secret key path matches what's declared in the module's `sops.nix`.
2. Check the secret exists in the SOPS file:
```bash
sops --decrypt secrets/nas-secrets.yaml | grep "the-key"
```
3. Check the `owner`/`group` set on the secret matches the service user.
### Can't decrypt — wrong age key
The machine's age key is derived from `/etc/ssh/ssh_host_ed25519_key`. If the host key was regenerated, the age key changed and existing secrets can no longer be decrypted.
To fix: re-encrypt the secrets file with the new public key:
```bash
# Get the new public key
nix-shell -p ssh-to-age --run 'ssh-to-age < /etc/ssh/ssh_host_ed25519_key.pub'
# Update .sops.yaml with the new key, then:
sops updatekeys secrets/nas-secrets.yaml
```
### Adding a new secret to an existing file
```bash
sops secrets/nas-secrets.yaml
# Editor opens with decrypted YAML — add your key, save, sops re-encrypts
```
## Nebula VPN Issues
### Peers can't connect
1. Verify the lighthouse is reachable on its public address:
```bash
nc -zvu mjallen.dev 4242
```
2. Check the nebula service on both hosts:
```bash
systemctl status nebula@jallen-nebula
journalctl -u nebula@jallen-nebula -n 50
```
3. Confirm the CA cert, host cert, and host key are all present and owned by the `nebula-jallen-nebula` user:
```bash
ls -la /run/secrets/pi5/nebula/
```
4. Verify the host cert was signed by the same CA as the other nodes:
```bash
nebula-cert verify -ca ca.crt -crt host.crt
```
### Certificate expired
Re-sign the host certificate:
```bash
nebula-cert sign -name "hostname" -ip "10.1.1.x/24" \
-ca-crt ca.crt -ca-key ca.key \
-out-crt host.crt -out-key host.key
# Update SOPS, rebuild
```
## Impermanence Issues
### Service fails because its data directory is missing after reboot
If a service stores state in a path that isn't in the persistence list, it will be wiped on reboot. Add it to `impermanence.extraDirectories`:
```nix
mjallen.impermanence.extraDirectories = [
{ directory = "/var/lib/my-service"; user = "my-service"; group = "my-service"; mode = "0750"; }
];
```
Then move the existing data if needed:
```bash
cp -a /var/lib/my-service /persist/var/lib/my-service
```
## Flake Input Issues
### Input update breaks a build
Roll back the specific input:
```bash
git checkout HEAD^ -- flake.lock
```
Or pin the input to a specific revision in `flake.nix`:
```nix
nixpkgs-unstable.url = "github:NixOS/nixpkgs/abc123def";
```
## Service Issues
### Service won't start
```bash
systemctl status <service>
journalctl -u <service> -n 100 --no-pager
```
### Caddy reverse proxy not routing
1. Check that `reverseProxy.enable = true` is set on the service.
2. Verify the subdomain matches: `reverseProxy.subdomain = "myapp"` → `myapp.mjallen.dev`.
3. Check Caddy logs:
```bash
journalctl -u caddy -n 50
```
### PostgreSQL database missing for a service
If `configureDb = true` is set, the database is created automatically. If it's missing:
```bash
sudo -u postgres createdb my-service
sudo -u postgres psql -c "GRANT ALL ON DATABASE my-service TO my-service;"
```
## Network Issues
### Firewall blocking a service
Check which ports are open:
```bash
sudo nft list ruleset | grep accept
```
Add ports in the system config:
```nix
mjallen.network.firewall.allowedTCPPorts = [ 8080 ];
```
Or if using `mkModule`, set `openFirewall = true` (it's the default).
## Getting Help
- NixOS manual: `nixos-help` or https://nixos.org/manual/nixos/stable/
- NixOS Wiki: https://nixos.wiki/
- NixOS Discourse: https://discourse.nixos.org/
- Nix package search: https://search.nixos.org/packages