July 13, 2018 3:36:38 PM CEST By Daniil Baturin
We have managed to resolve the infrastructure problems and bring the affected machines back intact. Now downloads.vyos.io and ci.vyos.net are back online, and our build hosts with all the build dependencies setup and uncommited code are back online too, so luckily we can resume the work from the point where it was stopped by the RAID issue.
We are not ready to give a complete post mortem on the actual issue yet (may not even be able to at all) since we have limited data and the service provider support was not exactly helpful. What we can say now is that what was thought to be RAID5 actually was a RAID1 with an additional hot spare drive, and the failure mode included a RAID controller glitch apart from drive failure. The hot spare drive is what allowed us to bring the data back intact.
While this issue was resolved without any data loss, it definitely prompts us to reconsider many things about our infrastructure, including backup strategy, deployment mechanisms, and service provider choice.
We are going to write new posts as we decide upon the options and roll out improvements. Infrastructure deployment is also a good area for contributions, and some people already offered help.