Is CopperEgg a reliable and stable monitoring solution?
This is a question I’ve asked myself over the course of the last 6-months. Outage after outage, I question how a company like this is able to stay in business and continue to operate profitably?
A fellow twitter follower tweeted me today about an article he wrote. I wanted to simply do a “make-shift” guest blog post to gain some more exposure to the issue at hand. You can find the original article posted on LinkedIn (here).
— Rob ‘Bubba’ Hines (@saysBubba) September 29, 2014
CopperEgg Availability Questions
By Robert Hines on Sep 29 2014
CopperEgg has put out a general call for customers to ask questions, I have a few, here they are:
Ernest, thanks for taking on the role, as well as taking what appears to be a bold step in addressing our faith in CopperEgg. Our trust and confidence has steadily eroded and could certainly use some shoring up. Seeing some effort to communicate with your customers is great, and indicates that perhaps there is justification in continued investment in CopperEgg Services.
However, at the moment, CopperEgg hasn’t presented itself as trustworthy; rather they seem to be working as hard as possible to demonstrate that they are not. We have a slew of questions, which we will post right here, and then watch closely to see if you respond, and if you do, how you respond.
Our first question is why you don’t effectively monitor your own services. In spite of what your FAQ says about your own monitor, in both major outages (and many minor outages over the past year), we have been aware of issues with CopperEgg up to 30 minutes prior to any notification from you. In many cases it seems that your operations team is merely waiting for customers to complain.
On a more serious note, it is even easier to jump to the conclusion that you are attempting to hide the fact that you are having issues. Perhaps someone thinks if they can just get it fixed before a customer notices, they can pretend the system is stable? If you were consistently five to ten minutes behind customers noticing issues, that would make sense, you need to confirm that there is an issue. However, consistently taking so long, it seems wise to believe you are either ignorant, inaccurate, or incompetent.
How are we to rely on a monitoring and alerting service that doesn’t appear capable of monitoring and alerting on itself?
Spring-boarding off this, CopperEgg service was completely disrupted for approximately 36 hours on September 17th. Contrasting the length and severity of that outage, CopperEgg provided a single, uninformative update, during the outage.
Why did CopperEgg choose to not communicate what was going on with their customers?
Finally, on the issue of lack of communication, CopperEgg has been down hard, on two separate occasions, in approximately a single week. We have seen nothing indicating that you intend to provide a detailed post-mortem explaining the root cause of the outage, what you are doing to prevent it from happening again, and why we should believe you have this under control.
When asked in a public forum (similar to this) what CopperEgg would be doing to reassure customers and/or provide restitution, CopperEgg responded with “email customer support”. There is so much wrong with that response.
Why choose this path?
More to the point, why not choose transparency and demonstrable character. Why not publicly and immediately promise your customers you will be working to win back their trust, starting with a full post mortem, and ending with clear refunds and/or service credits? Why not proactively reach out to your customers as opposed to telling them that they would need to email customer service to privately discuss this?
Not only has there been an astounding silence from CopperEgg, coupled with overt attempts to maintain the silence, what we have heard from CopperEgg has been full of misinformation. Case in point, Friday you tweeted that there was no data loss. However, we clearly lost data, a big chunk of it!
I may be way off base here, but I’d wager a large portion of my paycheck that everyone who uses any custom metrics lost 100% of that data during the outage as well. How could you post that statement? It merely feels like you are pouring salt in the wounds of our already strained relationship when you post things as true that we know are not true.
Finally, your handling of and response to the Amazon Reboots is pitiful. Your customers are on AWS as well, we got the same notification, we had to deal with the same issues. We have architected and engineered our cloud solution to handle issues like this without impacting our customers. Why haven’t you?
Seriously, at a minimum, I should not be woken up at 3:04am anytime, much less on Saturday morning, to a massive string of false negative alerts telling me that my entire flipping infrastructure crashed … when you knew at least three days in advance that this was coming!
How are we not to believe your current FAQ answer is a deflection attempting to prevent questioning why CopperEgg infrastructure isn’t better designed and implemented? (http://copperegg.com/important-notification-copperegg-service-interruption/ is a good move in the correct direction.)
We aren’t ignorant of the fact that bad things happen to people. We deal with our own share of outages, and (cough cough) our own share of our vendors outages causing our customers issues. We truly desire to be good customers, the kind of customers we would want.
However, utter failure to effectively and meaningfully communicate in the midst of outages, coupled with what feels like indifferent customer care, make it challenging to believe that you get it, that you want customers, particularly that you want us as customers. We don’t think you get it. Do you get it?
If so, will you:
- Communicate when and where we can expect the full post-mortem, root-cause to resolution, to be publicly posted?
- Tell us how you intend to “make this right” publicly (give us a full refund for the month of September)?
- Call us and apologize?
As for me personally, I just want some of my life back that has been irrevocably stolen from me on account of CopperEgg failures, outages, false positives, and false negatives. However, it’s my own just reward, I’m the one who ultimately architected and engineered our cloud solution, relying on and recommending CopperEgg.