In Feb 2005 a component of the BlackBerry network infrastructure experienced a service interruption which impacted all users. In 2007 a BlackBerry service outage impacted 5 million customers and RIM were perceived to have poor crisis management. Al Sacco has collated a timeline of this and other major BlackBerry outages for period 2007-2009. In conclusion, there have been several warnings about the health of the RIM network infrastructure which have not been dealt with effectively by the Senior Leadership Team.
“Wireless communication has gone from a travel convenience to a mission critical communications tool” – Randall De Lorenzo
RIM operates its own data network, including four Network Operations Centres (NOCs) for its wireless email system: two in Waterloo, Ontario (one for North America, one for South America), one in the UK (Slough) for Europe, the Middle East and Africa and one for the Asia Pacific region. All Blackberry data traffic passes through one of these centres.
The official statement reported that RIM experienced a core switch [hardware] failure. However there is chatter, reputably from a RIM insider, that a software upgrade was applied in the Slough NOC on Monday. The real point of failure appears to be a fundamental piece of RIM’s own software called the “Relay” which directs traffic within each of the four NOCs.
“I actually think the Relay has reached melting point and, err, melted,” the former RIM staffer said.
To compound matters the RIM Egham (Surrey) data centre’s Oracle database, a bespoke and heavy-duty communications data storage application, was corrupted. This database is effectively the “brain” of the BlackBerry Internet Service, handling messages and forwarding data to users. With saving the Oracle database the top priority, RIM was forced to repair software while it was still running – a difficult and fraught process known as a “hotfix”.
“Working with a live database like that is the stuff of nightmares,” explained one network engineer.
This database corruption problem, according to industry sources, is thought to be the reason the outage lasted well into Thursday for many users.
Mike Lazaridis – (Co CEO) posted this message “To all our Customers”. The message backlog has been relieved with service finally back to normal.[youtube http://youtu.be/zQ1esvGae_s]
RIM’s proprietary network architecture enables BBM messaging and data carriage with high efficiency, encryption and offloads carrier bandwidth. By operating its own network, RIM was able to provide a highly secure encrypted and reliable service when the company had a relatively small number of subscribers, but some telecom experts suggest it is difficult to scale up such a network to handle the traffic generated by 70m subscribers. Insiders say RIM’s management is aware of these problems (no need to encrypt video messages) and has been considering ways to address them.
Apologising for interruptions and delays, RIM’s chief information officer, Robin Bienfait, previously said on the company’s website: “You’ve depended on us for reliable, real-time communications, and right now we’re letting you down. We are taking this very seriously and have people around the world working around the clock to address this situation.
On Wednesday it emerged that an intermittent service outage preventing users accessing email had spread to 30m-40m people, half of all Blackberry subscribers worldwide. David Yach, RIM’s chief technology officer for software, said that RIM had to restrict service everywhere due to a backlog of undelivered messages. “Blackberry Jam” bottleneck?
For some users, everything has come to a standstill. For others e-mail works but BlackBerry Messenger is down. For yet others, e-mail comes in bursts every few hours and outgoing e-mails don’t go from the device.
Telecoms analyst Dean Bubley said the problem could lie in the way RIM “siphons off” all internet traffic to and from BlackBerry handsets for optimisation purposes, without allowing fallback to normal internet peering points in the event of RIM’s systems going down.
“RIM routes all data traffic via its servers and network infrastructure, [creating a] single point of failure,” Bubley told ZDNet UK. “It does lots of good things — it compresses data a lot, adds security, manages email connections, [enables] BBM and so on — but it also routes the ‘vanilla’ web traffic through that path as well.”
“It is sad to say, but it is almost impossible to deny: the CrackBerry has become the CrapBerry” Jon C. Ogg
Users are venting their “outage outrage” on Twitter and in the blogosphere. Many are reaching the same conclusion: this is a communications crisis for Research in Motion.
RIM you’d expect round-the-clock crisis management, constant, helpful communication and swift resolution. However, RIM appears to be pursuing a textbook version of how not to respond in a crisis situation.
The Ten Golden Rules of Crisis Communication
PAS 200:2011 is the new Publicly Available Specification standard for Crisis Management and was officially launched on 29th September 2011. The standard is designed to help organizations take practical steps to improve their ability to deal with crises.
Developed by leading crisis management experts, PAS 200 provides users with a framework that delivers the practical steps to identify potential crises, mitigate the risks and avoid potentially damaging [reputational] consequences.
Section 6 of PAS 200 provides guidance on how to Communicate in a crisis
6.2 Communications strategy
6.3 Formal and informal communications structures
6.4 Planning to communicate
6.5 Methods of communication
6.6 Barriers to effective communication
A lack of immediate response leads to a vacuum, which is almost always filled, with negative perception and commentary. It would seem that there have been no lessons learned by RIM management from previous outages.
The RIM blog – BlackBerry Help – would be an ideal medium through which to keep users informed. As of 11am Wednesday there was no mention on the Help blog of the near-total failure of its services across a significant percentage of the globe.
RIM Management definitely need to revisit their Crisis Management Plan (Red Book). The Crisis Management Plan is typically hosted in a private cloud to make it available from anywhere in the world to an authorised member of the Crisis Management Team.
All this comes at a precarious time for RIM’s management. The company is losing market share to Apple’s iOS and the Google’s Android system (down from 19% to 12% a year earlier). In the quarter ending in August its revenues were 10% lower than a year before at $4.2 billion and profits were down by more than half. This chart plots the RIMM share price for the last 12 months (source FT.com Market Data). RIMM share price closed the week at $23.97 from a peak of $70.54 in Feb 2011.
So what is the revised RIM Service Strategy? Should RIM ditch its proprietary network making use of best of breed carriers instead, This shift in direction will enable Senior Management to concentrate on Mobile (7 OS) and Playbook tablet software (QNX) as well as shifting new hardware devices.
Competing Service Offering
RIM’s claims of providing an enterprise-class service are quite hollow given the known single points of failure in their network design and number of service outages in recent years. How about you seriously consider the encrypted “iMessage” service from the fruit based company as a replacement for BlackBerry Messenger (BBM). What did the iPhone say to the BlackBerry – iWork.
RBS trial of iPhone poses new threat to BlackBerry. UK bank is the latest to test Apple smartphone’s suitability for confidential corporate communications following the large-scale BlackBerry shutdown last week.
By managing this crisis differently Jim and Mike could have mitigated the risk to the RIM brand and reputation. For a company that is providing a secure trusted service they have ignored this promise to their customers. It may be perceived that the RIM leadership are more interested in Technology given that Mike is an engineer by background rather than grounded in Service Principles. Lets see what happens to the [reputedly loyal] customer base as a result of not providing Service Utility and Warranty.
Service Improvement Action
Taking guidance from the new PAS 200:2001 and having a defined Crisis Management Team and Plan with clear role accountabilities is a must have because there will definitely be another network service outage (at least one per year since 2007, and the expected increase in customers in emerging markets will put further strain on the four NOCs).
How well will RIM respond next time it happens and will it be a tipping point for their loyal CrackBerry customers?