Once again, I’m reminded of just how crucial good error logs are. Last night our production MySQL instance went down, taking with it several missions-critical systems, namely JIRA. Of course, we didn’t know that at the time because JIRAs error logs never indicated a problem connecting to the DB, nor did at least one other affected application. Luckily, Stash had clearly put in the logs that the DB connection had timed out and we were able to use this information to go talk to DB guys to get this resolved.
I know that sometimes things fail in odd and unexpected ways and that this might not be a problem that could have been accounted for, but the takeaway is simply: make sure your logging is stout because it’s the difference between getting the system back up in a jiff or scratching your head.
This was the error that came up (in MySQL after we determined where the issue was).
141026 12:19:17 InnoDB: ERROR: the age of the last checkpoint is 9446264, InnoDB: which exceeds the log group capacity 9433498. InnoDB: If you are using big BLOB or TEXT rows, you must set the InnoDB: combined size of log files at least 10 times bigger than the InnoDB: largest such row.
It looks like we’ll need to some performance tuning to make sure we can handle large data blogs and large transactions.