Earlier today AU experience an extended service outage affecting the forums, blog and wiki. I was not immediately reachable, so I was not aware until sometime after 2pm, unfortunately. The web service was down, and the database was not responding, a restart of those services restored the system to a live status.
Early this morning (0305h) a file integrity check ran according to it’s daily schedule. Due to an inconsistency on the /var volume, the integrity check ran into some problems and consumed too much space on the volume. Other tasks that were also scheduled in the same time frame began to fail and email alerts were generated and sent to me, which I saw this morning. Upon checking the system I noticed the issues with /var and decided to dismount /var, run a disk check and remount it. While successful, the process created over 10000 files in lost+found that I needed to review. In the midst of that tedious process, I got tasked with dressing a freshly bathed toddler and prepping him for a day out with his parents. I completely forget to come back and finish what I had started. So the simple cause of this whole debacle was simply…
…human error. Mine.
Sorry folks, it’s all back up and running now… but I still have 9900 files to vet.