文/Mark Kozak-Holland 譯/楊磊
回顧一下泰坦尼克號(hào)當(dāng)時(shí)的情形:撞擊發(fā)生后(見第8部分)船體搖晃駛離冰架,重新啟航,開向海爾法客斯。一切都似乎無礙,但8節(jié)航速下20分鐘后,當(dāng)初的決策有多不準(zhǔn)確就已經(jīng)很顯見了。續(xù)航的行動(dòng)終嘗惡果,船進(jìn)了更多的水。其他本未受撞擊影響的部分也在水壓下開始漏水了。上漲的海水正演變成一場(chǎng)大浩劫。
如今,第一要?jiǎng)?wù)是邊確定永久性的修復(fù)方案,邊通過臨時(shí)性的補(bǔ)救措施來使服務(wù)迅速恢復(fù)上線。但是,此時(shí)根本之處在于,應(yīng)密切監(jiān)視服務(wù)環(huán)境,觀察補(bǔ)救措施是否見效。
包括結(jié)構(gòu)師托馬斯-安得魯斯和木匠約翰-哈金斯的第二調(diào)查組,報(bào)告說有5個(gè)船部的主體被淹了,并認(rèn)為這大違泰坦尼克號(hào)的設(shè)計(jì)初衷。沿船底的摩擦已嚴(yán)重撕裂了外殼并損壞了雙層船體。6個(gè)主要船部進(jìn)水速度的不同,也說明頂部船體已損。事態(tài)竟然會(huì)糟糕到如此境地,這超出了設(shè)計(jì)者的預(yù)想。
在如今的IT項(xiàng)目中,至關(guān)重要的是項(xiàng)目團(tuán)隊(duì)要對(duì)這樣一類任何補(bǔ)救措施都無濟(jì)于事、事態(tài)發(fā)展將超出MTTR規(guī)程(見第9部分)的不測(cè),預(yù)作計(jì)劃。對(duì)最終用戶和客戶,服務(wù)中斷了且難于修復(fù)。針對(duì)這樣的情形,在項(xiàng)目之內(nèi)就應(yīng)建立、準(zhǔn)備、計(jì)劃、測(cè)試災(zāi)難恢復(fù)規(guī)程(見第4部分),并且配以專人(運(yùn)行團(tuán)隊(duì)/技術(shù)支持)使之制度化。
結(jié)構(gòu)師意識(shí)到,泰坦尼克號(hào)狀況已超一般的事故恢復(fù)范圍,已演變成一場(chǎng)大浩劫。他說,船離沉沒還有2個(gè)半小時(shí)到3個(gè)小時(shí)。并準(zhǔn)確認(rèn)定已無力回天。太多的船部破裂,水淹至抽水機(jī)都不及挽救。各船部之間的防水隔墻,沒做到水密水平橫斷線的高度,所以當(dāng)船鼻下沉?xí)r,水從一個(gè)船部滲進(jìn)另一個(gè),就像水浸過制冰格盤一樣。舞廳實(shí)際上成為讓水向各部分派發(fā)的大通道。
此時(shí)我們已可發(fā)現(xiàn),項(xiàng)目建設(shè)階段(見第3部分)在非功能性需求上的那種妥協(xié),在這場(chǎng)浩劫中是如何引發(fā)巨大惡果的。
只有船長和部分指揮官確知損壞程度,而眼下只能眼睜睜看著船的下沉。沒有發(fā)出過“棄船”或其它正式的災(zāi)難公告。只在撞擊后的65分鐘時(shí),船長命指揮官們打開救生艇的遮布,并讓乘客和船員們都到甲板上。泰坦尼克號(hào)上沒有正式的災(zāi)難恢復(fù)計(jì)劃。
如果發(fā)生在今天,接下來應(yīng)啟動(dòng)災(zāi)難恢復(fù)計(jì)劃,并向所有人溝通該計(jì)劃。每個(gè)災(zāi)難恢復(fù)計(jì)劃都應(yīng)有考慮周全的溝通計(jì)劃,需向不同的聽眾清楚無疑地進(jìn)行溝通。
泰坦尼克號(hào)的船長在碰撞后很快就明白了問題的嚴(yán)重性,但是,他沒有通過其船員與乘客們完成溝通。這船上人們的困惑加劇了,尤其是船員們。比如,引擎室向甲板派出了工程師,可指揮部卻讓他們返回去。對(duì)船上這樣糟糕的溝通問題,可能的解釋有:
●船上裝備的溝通系統(tǒng)有限,沒有公告系統(tǒng)。重要信息只能通過船員們到各個(gè)艙位敲門后口傳給乘客。考慮到艙位數(shù)以百計(jì),這太費(fèi)時(shí)了。
●船員們本身就對(duì)實(shí)情不清楚,所以乘客們所能知曉的就莫衷一是。這個(gè)老船長對(duì)船體的安全系統(tǒng)太有信心,也許難于相信結(jié)構(gòu)師的判斷,因此開始的時(shí)候一切似乎都還正常。船長的表現(xiàn)幾乎就相當(dāng)于好像一切正常。
●船長深知救生艇數(shù)量不敷所需,大約只夠帶走全船2223人中的一半。所以,也許最好還是不制造恐慌,而在適當(dāng)時(shí)候讓救生艇在一片平和中有秩序地載走乘客。船體水平狀的結(jié)構(gòu),和艙位等級(jí)的界別,意味著頭等艙的乘客們可更優(yōu)先得到救生艇位。
●船長擔(dān)心恐慌的擴(kuò)散。他同下屬都知道14年前法國客輪La Bourgogne下沉的故事。當(dāng)時(shí)也只有一半乘客有救生艇位,引發(fā)一片恐慌。史密斯船長知道,他可以通過讓那些足夠幸運(yùn)者都上到救生艇上,來挽救盡量多的人。所以,他沒告訴所有乘客,尤其是3等艙的那些人。
如今,溝通計(jì)劃可能與災(zāi)難恢復(fù)計(jì)劃一樣重要。原因如下:
●與雇員的內(nèi)部溝通極有助于控制災(zāi)難的影響度。同時(shí),溝通的速度也很重要,比如可首先讓面向客戶的那些雇員獲悉訊息,因而他們能轉(zhuǎn)達(dá)客戶。
●與客戶的外部溝通也很重要。溝通計(jì)劃需要根據(jù)問題或?yàn)?zāi)難的大小范圍,以不同渠道來向顧客各個(gè)層級(jí)傳達(dá)。
●根據(jù)服務(wù)中斷的嚴(yán)重程度,和公眾媒體的溝通也許是必要的。這需要確定什么是關(guān)鍵信息,如何溝通發(fā)布,通過什么渠道。許多公司不再設(shè)防,流動(dòng)通信員帶著一些陷阱問題訪問不知情的雇員們。
結(jié)論
如今,許多IT項(xiàng)目由于沒有對(duì)最壞情況準(zhǔn)備對(duì)策,而在運(yùn)行中大打折扣。光有MTTR規(guī)程還不夠。除了災(zāi)難恢復(fù)計(jì)劃,一個(gè)考慮周全的溝通計(jì)劃也必須到位。下一部分將著眼于災(zāi)難恢復(fù)的啟動(dòng)。
原文:
In recapping Titanic’s situation, following the collision (Part 8) the ship was restarted and limped off the ice shelf with the objective of sailing back to Halifax. Everything appeared to be in good shape, but after 20 minutes of sailing at 8 knots it was apparent that the initial determination was grossly inaccurate. The forward motion had taken its toll and the ship had taken on more water. Parts of the ship initially unaffected under the strain of the water had started to spring leaks and the increase in flooding was becoming catastrophic.
In today’s world, getting service back online is a top priority by applying a temporary fix whilst a permanent fix is created. However, in such a situation it is essential the service delivery environment is closely monitored to whether the fix is holding.
The second search party, with the architect Thomas Andrews and the carpenter John Hutchinson, reported major flooding in five compartments and recognized that Titanic was not designed for this. The grinding along the bottom had badly ruptured the outer skin and damaged the double hull. The different rates of flooding in the six primary compartments indicated the top hull or tank top was damaged. It was beyond the expectations of the designer that something in nature could inflict so much damage.
In today’s IT projects, it is vital that the project team plan for such an eventuality where the fix is not resolving the problem and the situation goes beyond the Mean Time To Recovery (MTTR) for the IT solution (see Part 9). The service is unavailable, to end-users and customers, and not readily recoverable any more. For this situation disaster recovery procedures need to be set up, prepared, planned and tested in the project itself (Part 4) and "institutionalized" with the staff (operations groups/technical support).
The architect realized the situation onboard Titanic had gone beyond normal problem recovery and had become a disaster. He stated that the ship had 2.5 to 3 hours before completely sinking, and accurately determined that the problem could not be fixed. Too many compartments were ruptured and were rapidly flooding beyond the capacity of all the pumps. The bulkhead walls, separating the compartments, had not been carried up to watertight horizontal traverses. Therefore, as the ship’s nose went down, water spilled from one compartment to another rather like an ice cube tray filling with water. The ballroom acted as massive channel for distributing water horizontally across the ship.
At this point in the story we see how the compromises to the non-functional requirements during the construction phase (see Part 3) of the project had a massive consequence in the disaster.
Only the captain and a few officers knew the extent of the damage and were now resigned to the ship sinking. No "abandon ship" command or formal declaration of a disaster was given. Around 65 minutes after the collision the captain just gave orders to the officers to uncover the lifeboats and get the passengers and crew ready on deck. No formalized disaster recovery plan was in place on board Titanic.
In today’s world, the next step would be to invoke a disaster recovery plan and communicate it to all onboard. Every disaster recovery plan needs to be accompanied with a well-thought-out communication plan. This needs to clearly communicate with different audiences.
Titanic’s captain knew the seriousness of the situation relatively quickly from the collision, but did not communicate this through the ranks of crew and passengers on board. This increased the confusion, particularly with the crew. For example, the engine room sent some engineers to the boat deck, but the bridge sent them back down to the engine room. There are number of possible explanations for the poor communication aboard Titanic:
·The ship had very limited communication, with no public-address systems. Important information was communicated to passengers by word of mouth, the crew knocking on each cabin door and common room. Considering there were hundreds of cabins, this could take hours.
·The crew didn’t have accurate information on the situation, so varying degrees of information were passed to passengers. The experienced captain believed in the safety systems of the ship and might have found the architect’s verdict very hard to accept because everything appeared so normal in the first hour. The captain acted almost as if the situation was "business as usual."
·The captain realized that the carrying capacity of the lifeboats was inadequate, with only enough room for about half of the estimated 2,223 people on board. Perhaps better to keep things calm, and allow the lifeboats to be filled in an orderly manner when the timing was right. The ship’s hierarchical structure and segregation of classes meant that first-class passengers had the best access to the boats.
·The captain feared widespread panic. He and the other officers were aware of the French liner La Bourgogne, which sank 14 years earlier. With room in the lifeboats for only half the people onboard, widespread panic had broken out. Captain Smith knew he could save the maximum number of lives by loading only those who were lucky enough to reach the boats. So, he may have avoided informing all the passengers, specifically in third class.
In today’s world a communication plan is probably as important as a disaster recovery plan, for several reasons:
·Communicating internally with your employees can greatly help control the impact of a disaster. Also, the speed of communication is essential. For example, get information to customer-facing employees first, so they can inform customers.
·Communicating externally with your customers is essential and the plan needs to cater to customer segments using different channels, depending on the scope of the problem or disaster. A customer-retention strategy might need to be offered.
·Communicating with the press may be necessary depending on how serious the loss of service is. This requires the identification of key messages, how these are communicated, and through what channels. Many companies have been caught off guard when roving reporters trap unaware employees with questions.
Conclusions
Today, many IT projects severely compromise an operation by not preparing for worst case scenarios. In today’s world, MTTR procedures are not enough. Aside from a disaster recovery plan, a well-thought-out communication plan needs to be in place. The next installment will look at invoking disaster recovery.
【?發(fā)表評(píng)論?0條?】