KeyChest Status
Brief updates of KeyChest status and planned operational changes

KeyChest Service Status Updates (times are in UTC)

Note: real-time server monitoring and alerts (above) added on May 13.

IPv6 addresses not being audited

Time: till 17 July, 2018, since ... quite some time

Upgrades of KeyChest audit engines introduced a regression error which caused IPv6 addresses not being resolved. The problem was fixed within 20 mins of its verification. However, it was probably present for quite some time.

All server audit results should be corrected within 12 hours.

Quick Audit not working in some browsers

Time: 18pm, 8 July 2019 - 9 July 9:55am BST

After today's upgrade, we can see problems to open KeyChest due to older JS scripts. We are reviewing the issues and improving the code.

Duplicate entries in weekly reports

Time: June - 8 July 2019

We have finally removed duplicate entries from weekly email reports. The KeyChest has changed significantly over the last 6 or so months and the emailing module is now catching up with those changes.

Please accept our appologies for this inconvenience, which in several cases lead to incorrect report details.

Error in discovering subdomains

Time: 23 April - 26 June

We have discovered and fixed a regression error, which impacted discovery of subdomains. We have taken the opportunity and strengthened robustness of this functionality. We will keep monitoring this aspect to make sure it has been full fixed.

Missing items should be automatically registered within 24 hours.

Occasional conection resets

Time: 20 June, 7:30am - 8:30am (BST)

As part of our maintenance, we needed to clear web connections

Some old IP addresses not correctly removed from audits

Time: 12 June, 9:00am (BST)

We mentioned earlier in weekly email reports, that some entries appear twice. While we have fixed the cause, the database still contains some old entries. we have now deployed the first of two-step clean-up process and we hop changes will be completed in the next 24 hours.

System failure due to DB cluster failure

Time: 20 May 4:00am - 20 May 6:30am (BST)

Increased throughput of the audit engine makes database cluster unstable. We are rethinking the operational side. There are several options we are looking at but first we need to get an opinion from Digital Ocean support, which is still pending. Their initial view is that downtimes / brownouts of internal networks should not be happening but we received no authoritative answer yet (as of 12th June).

System failure due to exhaustion of available memory

Time: 13 May 22:00pm - 13 May 23:05pm (BST)

We have installed a new audit engine at 19:45pm (BST), which contained a bug preventing it to react to increasing usage of RAM. This caused a failure at 22:01pm, which we've started researching at 22:06pm. The system recovery, bug discovery, a bug fix and the system recovery have been completed at 23:05pm.

KeyChest stability update

Time: 11 May, 18:00pm - 19:00pm (BST)

Upgrading audit engine to a new multiprocessing version. This may cause short downtimes.

KeyChest stability update

Time: 10 May, 14:55pm - 16pm (BST)

Installing Python3 and upgrading some other packages which may cause short downtimes.

KeyChest stability update

Time: 7 May, 10:04am

We have spent a considerable amount of time on improving the stability of KeyChest. It appears that an underlying problem is in network issues between its servers. We have now deployed some configuration changes, custom real-time metrics for the Netdata monitoring and will keep a close eye on this issue. For those interested, I've written a detailed description of our efforts in my blog https://magicofsecurity.com/mysql8-cluster-and-networking-problems/, which was picked up by the CEO of Percona :) (https://www.linkedin.com/feed/update/...)

The whole system was down for about 2 hours with intermittent follow-up issues

Time: 2 May 5:30am - 9:00am

Internal networking issues caused a disconnection of the database. We have updated the database system to the newest version (from 8.0.15 to 8.0.16) as it provides a new configuration options for unreliable networks. We requested support from DigitalOcean but we have to accept that network issues can happen and we need to be resilient. We also filed a new feature request for Netdata ( https://github.com/netdata/netdata) to collect cluster information for MySQL8.

Intermittent problems to login

Time: 29 Apr 18:00pm - 29 Apr 21:00pm

System failed to delete old log files that caused insufficient free disk space for intenal interprocess communication (Redis).

Update: 30 Apr, 12pm: an hour long downtime followed between 9:00am and 10:00am this morning as a result of invalid cached data.

Billing page is likely to show errors due to expired certificate at jsonvat.com

Time: 29 Apr 6:50am - 29 Apr 11:00am

We use jsonvat.com to collect European VAT rates. We will try to remediate this problem before 11am

Planned KeyChest upgrade

Time: 10 Apr 21:00pm - 10 Apr 23:00pm

We plan to do a significant upgrade of KeyChest, which includes database changes to support new features like search of domains similar to yours, real-time monitoring of the system, and upgrade of the audit engine from Python27 to Python36.

Expected impact: KeyChest will be in maintenance mode and its web interface will be not available.

Planned Digital Ocean datacenter upgrade

Time: 26 Mar 21:00pm - 27 Mar 3:00am

Expected impact: There may be a few minutes of increased latency as well as small amounts of packet loss while we shift traffic to redundant devices. We will endeavor to keep this to a minimum for the duration of the change.

KeyChest upgrade

Time: 17 Mar 23:40am - 18 Mar 3:35am

The service was taken down for upgrades and maintenance.

KeyChest outage

Time: 15 Mar 7:40am - 15 Mar 13:05pm

A couple of things happened at the same time - but they were triggered by network issues between our virtual servers. Those caused a rapid growth of database synchronization data, which filled one of the servers' disks. The recovery was a matter of restoring from backups but it took a little longer, unfortunately.

Database node outage

Time: 12 Mar 9:40am - 13 Mar 10:30am

This was the second time within a fortnight and it became clear that we should not try to optimize for speed if it means losing resiliency. We have now disabled direct access to a local database nodes for reading. There is a small impact on speed of the web UI but KeyChest is more resilient as it fully utilizes its failure resistant database cluster.

Improving performance of the audit engine

Time: 9 Mar 11:39am

Ever since the KeyChest migration to Digital Ocean, we can see delays in audits while the CPU is only used at 30% of its capacity. We are working on removing bottlenecks so that we can show you first results for new domains more quickly.

One of database nodes down due to network errors

Time: 7 Mar 4:40pm

Update @18:42, 7 Mar: the database cluster is back to normal.

One of our database nodes dropped out from the database cluster due to network errors. We have restarted it and the system should be back to normal shortly. There may be a small impact on real-time notifications for servers added / discovered between 8:50am and 16:30pm UTC.

API changes - integration

Time: 7 Mar 10:14am

We are improving the integration of the API with online reports and the audit engine. The update will also allow users to manually add services with wildcard certificates under a correct main domain. It will improve visibility of wildcard certificates. It will also allow us improving online reports with regards to wildcard certificates.

Audit Engine incorrectly ignores valid domains

Time: 18 Feb 9:40am - 19 Feb 23:20

Update @23:20pm, 19 Feb: the audit engine has now updated all faulty records.

Update @10:32am: we have identified a probable cause of the problem. The audit engine will be updating results over the next 12 hours. We will probably re-send updated weekly summaries once the re-audit has been completed.

The effect of the error is out of date information in weekly reports. We are looking into the issue and expect to resolve it today. We will provide updates in due course.

Upgrade to version 1.7.7

Time: 18 Feb 2019, 08:40 - 08:45

Upgrade resolving some bugs and small improvements.

Upgrade to version 1.7.6

Time: 17 Feb 2019, 23:55 - 00:01

Upgrade resolving some bugs and small improvements.

Audit Engine running to its schedule

Time: 14 Feb 2019, 16:00

Periodic audits were behind the schedule before the migration and the migration caused an additional backlog. Faster server has now helped to catch-up with all the backlog and periodic audits of your services are now executed as configured - every 12 hours.

ROCA tester fails

Time: 12 Feb - 14 Feb 11:50 am

We have discovered a problem with the ROCA tester after the migration to Digital Ocean. The problem has been fixed and the tester should be now fully functional.

Upgrade to version 1.7

Time: 13 Feb 2019, 13:00 - 13:04

We have migrated KeyChest to a Digital Ocean datacenter.

Migration to Digital Ocean

Time: 9 Feb 2019, 23:50 - 10 Feb 2019, 5:00am

We have migrated KeyChest to a Digital Ocean datacenter.

Creating this status page

Time: 14 February, 2019

We decided to create a simple page for the service status updates. We will keep adding entries about discovered problems / bugs as well as about planned operational changes.