文/Mark Kozak-Holland 譯/楊磊
再扼要回顧一下把當(dāng)時(shí)的情況:泰坦尼克號(hào)的指揮官們拼命想躲過(guò)一場(chǎng)撞擊(見(jiàn)第8部分)。但是,“S型轉(zhuǎn)向”這個(gè)正確的決策仍未能使船足夠減速。數(shù)以百計(jì)的旅客在事后說(shuō),泰坦尼克的船體幾乎平白無(wú)故地來(lái)了個(gè)停頓,顫動(dòng)著、響起數(shù)秒鐘咕嚕嚕的滾動(dòng)和摩擦聲音,如同船體正從大量的石頭彈子上翻側(cè)過(guò)去似的。
并沒(méi)出現(xiàn)所謂“驟然急?!?、災(zāi)禍、或者哪怕是輕微的受傷什么的。也沒(méi)出現(xiàn)猛烈的側(cè)向搖晃,或沿船體側(cè)線的重復(fù)沖撞。這些情況,本會(huì)在船體要費(fèi)力避開(kāi)從側(cè)面撞來(lái)的冰山的時(shí)候出現(xiàn)的。放在飯廳的早餐餐具幾乎沒(méi)顫動(dòng),頭等吸煙室和休閑廳內(nèi)的飲料也一點(diǎn)沒(méi)灑漏。一切跡象說(shuō)明,船底剛好給擱在位于水下冰山基部的某一處冰架上了。默多克大副成功地規(guī)避了一場(chǎng)本可能讓頭前四個(gè)船廂粉身碎骨、并將殺傷數(shù)百名旅客的“迎頭一擊”。
同樣地,一個(gè)IT解決方案在生產(chǎn)營(yíng)運(yùn)階段出現(xiàn)不穩(wěn)定時(shí),根據(jù)項(xiàng)目本身預(yù)先準(zhǔn)備、計(jì)劃、和測(cè)試過(guò)的規(guī)程,會(huì)采取一系列行動(dòng)(見(jiàn)第4部分)。這種規(guī)程應(yīng)以所謂MTTR (平均恢復(fù)時(shí)間)為基準(zhǔn),其主旨是使得該IT解決方案能從故障中盡快恢復(fù)上線,以滿足所謂的“服務(wù)水準(zhǔn)協(xié)議”SLAs。繼而在后臺(tái)通過(guò)某種臨時(shí)、或者長(zhǎng)遠(yuǎn)的修補(bǔ)來(lái)得到完善。
誠(chéng)然,正式上線前,方案的完整性首先要得到確立,以防故障再次出現(xiàn)。以時(shí)間為基準(zhǔn),運(yùn)營(yíng)團(tuán)隊(duì)將走完前述流程和故障的四個(gè)區(qū)間,即“故障探測(cè)”、“故障確定”、“故障解決”和“從中恢復(fù)”。 “平均恢復(fù)時(shí)間”MTTR一旦開(kāi)始計(jì)時(shí),意味著“服務(wù)中斷”(一種損失,見(jiàn)第二部分)的開(kāi)始,應(yīng)按“用戶損失分鐘數(shù)”來(lái)采集評(píng)測(cè)指標(biāo),該“用戶損失分鐘數(shù)”可衡量出多少用戶得不到服務(wù)以及持續(xù)了多久。
這種方法,遠(yuǎn)比常用的所謂“服務(wù)可用性百分比”,如99.999%的評(píng)測(cè)方式來(lái)得更精確。那末,泰坦尼克(“平均恢復(fù)時(shí)間”MTTR中)的“故障探測(cè)”期間,就是瞭望觀察哨給出警報(bào)的那37秒。但在IT解決方案中如此(長(zhǎng)的“故障探測(cè)”期)并不常見(jiàn),通常的情況倒是會(huì)在大問(wèn)題出現(xiàn)前,就為處理故障給出了較好的提示警報(bào)。這給了自動(dòng)化的、或者是人力的營(yíng)運(yùn)者以時(shí)間來(lái)首先防止問(wèn)題的發(fā)作。(見(jiàn)第八部分)。
接下來(lái),泰坦尼克號(hào)的船長(zhǎng)、主管和指揮人員們?cè)谂灅虿考e確定行動(dòng)步驟。作為“確定故障”的一部分,兩組人員分別被派往船的頭部、中部調(diào)查受損程度。第一組在10分鐘內(nèi)就帶回了積極的報(bào)告:無(wú)大損傷,無(wú)漏水。在主管布魯斯-埃斯梅的頭腦中,故障的“探測(cè)”和“確定”期就此結(jié)束了。至于以發(fā)出遇險(xiǎn)或者求救信號(hào)的方式來(lái)完成隨后的“故障解決”環(huán)節(jié),對(duì)他來(lái)說(shuō)卻真是個(gè)大問(wèn)題了,因?yàn)槟菚?huì)給泰坦尼克招來(lái)大量流言蜚語(yǔ),將有損白星公司的市場(chǎng)位置,并且那種吸引了滿世界富豪精英都來(lái)乘坐這有史以來(lái)最安全航班的輝煌市場(chǎng)效應(yīng),也將毀于一旦。
其實(shí),此時(shí)更好的“故障解決”方案,應(yīng)該是把船開(kāi)回加拿大哈里發(fā)克斯港,避開(kāi)紐約這一世界新聞中心。這樣,他也可編出一個(gè)更好的新故事,把此次事故邊沿化成一樁小事而已。他還能讓乘客們都棄舟而改上火車,把船體修補(bǔ)一下后就開(kāi)回貝爾法斯特作大修。事實(shí)上,他甚至可以大談裝備了最新式應(yīng)急系統(tǒng)的泰坦尼克、本身就是一艘怎樣的救生船,是如何從一場(chǎng)巨大災(zāi)難的邊沿中成功自救的,還能把白星公司航線的安全性更進(jìn)一步地加以宣傳推廣。
現(xiàn)今的IT解決方案中,“故障的確定”要評(píng)估其給用戶帶來(lái)的影響。“確定”本身,必須有“證據(jù)”可支持,在確定問(wèn)題是否惡化升級(jí)了、引發(fā)源頭是什么上面,重新調(diào)查反饋機(jī)制和日志是至關(guān)重要的。
在一個(gè)大型復(fù)雜的IT解決方案中,?,F(xiàn)所謂“多米諾聯(lián)動(dòng)效應(yīng)”,即一個(gè)小的故障點(diǎn)比如某個(gè)子系統(tǒng),會(huì)波及其相關(guān)鄰接者,從而引發(fā)大量的后續(xù)問(wèn)題。如果不準(zhǔn)確地理清這些故障事件之間的關(guān)聯(lián)順序,將導(dǎo)致誤判乃至做出錯(cuò)誤修補(bǔ),以及問(wèn)題的再次發(fā)生。只有當(dāng)對(duì)問(wèn)題根因的估計(jì)得到測(cè)定和證實(shí)后,故障的“確定”期間才算正式完成。
對(duì)一個(gè)IT解決方案,重要的是保證掌握了“證據(jù)”,并提出下列問(wèn)題:該方案是否預(yù)知自己將出現(xiàn)故障?如果是,那末是否有任何(自動(dòng)的)防范行為發(fā)揮作用了?這些防范行為是否通知了人、或自動(dòng)化操作者?反饋機(jī)制是否本身有問(wèn)題、或反饋了不可信的數(shù)據(jù)?“故障的探測(cè)”
是否正確完成了?
泰坦尼克已處于緊要關(guān)頭,但還未陷災(zāi)難。埃斯梅為保全面子所累,而他對(duì)白星公司好名聲的渴求所造成的環(huán)境氛圍,使得任何問(wèn)題都容易發(fā)生。泰坦尼克蹲在水下的冰架上,似乎完全沒(méi)事;如果報(bào)安全為上、以防萬(wàn)一的態(tài)度收船回航,也可能發(fā)現(xiàn)不過(guò)是小問(wèn)題而已;埃斯梅倉(cāng)促之間作出決策。而此時(shí)第二組帶了結(jié)構(gòu)師和木工的損傷調(diào)查組尚未有評(píng)估報(bào)告返回。
今天的IT項(xiàng)目可吸取的教訓(xùn)在于:“故障的解決”中,重要的是在對(duì)可供選擇的行動(dòng)方案一一考察時(shí),要在所有的“證據(jù)”基礎(chǔ)上、考慮相關(guān)風(fēng)險(xiǎn)。唯此后,才可開(kāi)始這最后“從故障中恢復(fù)”的環(huán)節(jié),即營(yíng)運(yùn)團(tuán)隊(duì)根據(jù)“服務(wù)水準(zhǔn)協(xié)議”SLAs讓IT方案重新上線恢復(fù)服務(wù)。
在泰坦尼克上,作為故障解決的一環(huán),并未對(duì)所有的可選行動(dòng)方案進(jìn)行充分考慮。埃斯梅做出了錯(cuò)誤的決策,讓船繼續(xù)前進(jìn),并電告引擎室“以最低速度前進(jìn)”來(lái)完成“從故障中恢復(fù)”的環(huán)節(jié)。工程師時(shí)候證實(shí),船以伴有碾摩雜音的3節(jié)速度繼續(xù)前進(jìn)。
結(jié)論
今天,許多IT項(xiàng)目在營(yíng)運(yùn)階段大打折扣,是因?yàn)轫?xiàng)目計(jì)劃的某種不充分:即沒(méi)有以MTTR時(shí)間為基礎(chǔ)來(lái)計(jì)劃“故障解決流程”。這樣(計(jì)劃充分的)流程,在幫助營(yíng)運(yùn)團(tuán)隊(duì)迅速恢復(fù)服務(wù)并保持一定的服務(wù)水準(zhǔn)方面都至關(guān)重要。這種(計(jì)劃充分的)流程,也應(yīng)通過(guò)系列檢查來(lái)實(shí)施各部門之間的相互制衡,以將在壓力狀態(tài)下犯錯(cuò)的可能性降到最低。這種(計(jì)劃充分的)流程,還要析構(gòu)出“角色與職責(zé)”結(jié)構(gòu),以保證讓正確的職員作正確的決策。
下一部分,將著眼于災(zāi)難狀況中的泰坦尼克指揮人員是如何作反應(yīng)的。
原文:
In recapping the famous ship’s situation, Titanic’s officers tried desperately to avoid a collision (see Part 8). However, the S-turn, a good decision, failed to decelerate the ship enough. Titanic almost innocuously came to a halt later described by hundreds of passengers as a quiver, rumble or grinding noise that lasted a few seconds as if the ship was rolling over a thousand marbles.
There was no "crash stop," fatalities or even minor injuries. There was no violent jolt sideways or repeated strikes along the ship’s length. This is common with a side swipe against an ice spur when a ship is turning very hard away from it. The breakfast cutlery that was laid out in the dining salons barely trembled, and drinks remained unspilled in the first class smoking rooms and lounges. All the evidence indicates that the ship came to rest on an underwater ice shelf at the base of the iceberg. Murdoch had prevented a head on crash that could have demolished the first 4 compartments, and killed and maimed hundreds of passengers.
Likewise, when an IT solution falters in production steps are taken according to a process prepared, planned and tested in the project itself (see Part 4). The process should be based around a Mean Time To Recovery (MTTR) clock were the principal objective is to get the IT solution back on-line as quickly as possible to meet Service Level Agreements (SLAs). The solution is then patched up in the background and a temporary or permanent fix applied.
However, before going on-line, the integrity of the solution needs to be first established so the problem does not reoccur. With an eye on the clock, the operations group steps through the process and the four "problem" quadrants of detection, determination, resolution and recovery. When the MTTR clock starts ticking, signifying the beginning of loss of service (an outage, see Part 2), metrics should be captured as User Outage Minutes (UOMs), which measure how many users experience service loss and for how long.
This is far more accurate than measuring with the more commonly used percentage of service availability, e.g., 99.999 percent. Problem detection on Titanic was 37 seconds of warning given by the lookouts. This is not typical with an IT solution, which is likely to put out errors and warnings well before any significant failure occurs. This provides operators, automated or human, time to prevent the problem from occurring in the first place (see Part 8).
Titanic’s captain, director and officers gathered on the bridge to determine a course of action. As part of problem determination to the extent of the damage, two search parties were dispatched into the bowels of the ship, front and mid-ship. The first party returned within 10 minutes with a positive report of no major damage or flooding. In director Bruce Ismay’s mind, problem detection and determination were now complete. Resolution with a distress call was a problem for him as it would compromise White Star’s position by shattering the hype around Titanic and destroy the brilliant marketing (see Part 2 and Part 5) that had lured the world’s wealthy elite onto the safest liner ever built.
A better resolution would be to get the ship back to Halifax, away from New York and the center of the world’s press. He could then better contain the news story, and marginalize it as a minor incident. He would be able to disembark passengers onto trains, patch the ship up and sail her back to Belfast for repairs. In fact, he could boldly claim that Titanic, a lifeboat in itself with all the latest in emerging technologies, was able to save herself from a potential disaster and further push the safety claims of White Star lines.
With an IT solution today, determination of the problem assesses the impact of the solution on users. Determination has to be consistent with the available evidence. Reinvestigation of feedback mechanisms and logs is vital to determine if the problem has been building up and what is causing it.
In a complex IT solution, it is common to see the domino effect, where a small faulty element like a subsystem knocks out elements around it and triggers a cascade of problems. Not working out this precise sequence of events could lead to a misdiagnosis where a wrong fix is applied and the problem reoccurs. Determination is completed when the root cause assumptions of the problem are tested and proven to be correct.
With an IT solution it is important to be sure of the evidence at hand and to ask the following questions. Was the IT solution aware it was going to fail? If so, were any (automated) preventative actions attempted? Did it alert human or automated operators? Were any of the feedback mechanisms faulty and provide unreliable data? Is the diagnosis of the problem correct?
Titanic’s situation was critical but not catastrophic. Ismay was hell bent on saving face and his anxiety over White Star’s reputation created an atmosphere where mistakes were easily made. Titanic appeared to be completely stable, sitting snugly on the underwater ice shelf. May be with due care they could dislodge the ship with a minimum of damage. Ismay rushed into making a decision. The second search party with the architect and carpenter had not even returned with an assessment.
The lesson from this for IT projects today is that in resolving the problem it is important to consider the alternative courses of action available with the risk associated with each based on all the collected evidence. Only then should the last quadrant of recovery commence. This is where the operations group puts the IT solution back on-line and resumes services, according to SLAs.
On Titanic, not all courses of action were adequately explored as part of the problem resolution. Ismay made the fateful decision to sail forward and telegraphed the engine room "dead slow ahead" in recovering the situation. Engineers later testified the ship moved forward at 3 knots with a grinding noise.
Conclusions
Today, many IT projects severely compromise the operation stage by not planning adequately in the project for a process to deal with problems around a MTTR clock. A process is critical for enabling the operations group to quickly restore service and maintain service levels. A process should also carry the checks and balances (through reviews) to minimize the likelihood of mistakes made in a pressure situation. A process should outline responsibilities and roles to ensure the right personnel make the right decisions.
The next installment will look at how the officers reacted to the disastrous situation.
【?發(fā)表評(píng)論?0條?】