Abstract:

Provided are techniques for restoring firmware. A first programmable
hardware device determines that a second programmable hardware device
needs a valid firmware image, retrieves a copy of the valid firmware
image from an external memory, and sends the valid firmware image to the
second programmable hardware device via a private communication link,
wherein the private communication link enables private communication
between the first programmable hardware device and the second
programmable hardware device. The second programmable hardware device
restores existing firmware using the valid firmware image.

Claims:

1. A computer-implemented method for restoring firmware, comprising:under
control of a first programmable hardware device,determining that a second
programmable hardware device needs a valid firmware image;retrieving a
copy of the valid firmware image from an external memory; andsending the
valid firmware image to the second programmable hardware device via a
private communication link, wherein the private communication link
enables private communication between the first programmable hardware
device and the second programmable hardware device; andunder control of
the second programmable hardware device,restoring existing firmware using
the valid firmware image.

2. The method of claim 1, wherein the external memory is dedicated to the
first programmable hardware device.

3. The method of claim 1, wherein the external memory is shared by the
first programmable hardware device and the second programmable hardware
device.

4. The method of claim 1, wherein the external memory is dedicated to the
second programmable hardware device and accessible by the first
programmable hardware device.

5. The method of claim 1, wherein the first programmable hardware device
and the second programmable hardware device are redundant and use a same
firmware.

6. The method of claim 1, wherein the first programmable hardware device
and the second programmable hardware device are not redundant and use
different firmware.

7. The method of claim 1, wherein determining that the second programmable
hardware device needs the valid firmware image further comprises:using a
heartbeat function, wherein when the second programmable hardware device
does not respond to a heartbeat function message from the first
programmable hardware device within a predetermined period of time, the
first programmable hardware device determines that the second
programmable hardware device needs the valid firmware image.

8. The method of claim 1, wherein determining that the second programmable
hardware device needs the valid firmware image further
comprises:receiving an indication from the second programmable hardware
device requesting the valid firmware image, wherein the second
programmable hardware device specifies a version of the valid firmware
image.

10. A computer-implemented method for restoring firmware, comprising:under
control of a first programmable hardware device,determining that a second
programmable hardware device needs a valid firmware image;retrieving a
copy of the valid firmware image from an external memory; anddirectly
updating an internal memory of the second programmable hardware device
with the valid firmware image by writing directly to a utility that
updates the internal memory; andunder control of the second programmable
hardware device,restoring existing firmware using the valid firmware
image.

12. A computer program product comprising a computer useable medium
including a computer readable program, wherein the computer readable
program when executed causes:under control of a first programmable
hardware device,determining that a second programmable hardware device
needs a valid firmware image;retrieving a copy of the valid firmware
image from an external memory; andsending the valid firmware image to the
second programmable hardware device via a private communication link,
wherein the private communication link enables private communication
between the first programmable hardware device and the second
programmable hardware device, wherein existing firmware at the second
programmable hardware device is restored using the valid firmware image.

13. The computer program product of claim 12, wherein the external memory
is dedicated to the first programmable hardware device.

14. The computer program product of claim 12, wherein the external memory
is shared by the first programmable hardware device and the second
programmable hardware device.

15. The computer program product of claim 12, wherein the external memory
is dedicated to the second programmable hardware device and accessible by
the first programmable hardware device.

16. The computer program product of claim 12, wherein the first
programmable hardware device and the second programmable hardware device
are redundant and use a same firmware.

17. The computer program product of claim 12, wherein the first
programmable hardware device and the second programmable hardware device
are not redundant and use different firmware.

18. The computer program product of claim 12, wherein the computer
readable program when executed:uses a heartbeat function, wherein when
the second programmable hardware device does not respond to a heartbeat
function message from the first programmable hardware device within a
predetermined period of time, the first programmable hardware device
determines that the second programmable hardware device needs the valid
firmware image.

19. The computer program product of claim 12, wherein the computer
readable program when executed:receives an indication from the second
programmable hardware device requesting the valid firmware image, wherein
the second programmable hardware device specifies a version of the valid
firmware image.

20. The computer program product of claim 12, wherein the external memory
stores multiple versions of the valid firmware image.

21. A computer program product comprising a computer useable medium
including a computer readable program, wherein the computer readable
program when executed causes:under control of a first programmable
hardware device,determining that a second programmable hardware device
needs a valid firmware image;retrieving a copy of the valid firmware
image from an external memory; anddirectly updating an internal memory of
the second programmable hardware device with the valid firmware image by
writing directly to a utility that updates the internal memory, wherein
existing firmware at the second programmable hardware device is restored
using the valid firmware image.

22. The computer program product of claim 21, wherein the utility
comprises a Joint Test Action Group (JTAG) interface.

Description:

RELATED APPLICATIONS

[0001]This application is related to the following commonly assigned and
co-pending U.S. patent applications:

[0002]Application Ser. No. ______, filed on the same date herewith,
entitled "AUTOMATED FIRMWARE RESTORATION TO A PEER PROGRAMMABLE HARDWARE
DEVICE", by Earle Ellsworth et al., with Docket No. TUC920060184US2, and
which is incorporated herein by reference in its entirety; and

[0003]Application Ser. No. 11/304,407, filed on Dec. 14, 2005, entitled
"SIMULTANEOUS DOWNLOAD TO MULTIPLE TARGETS", with Docket No.
BEA920050029US1, and which is incorporated herein by reference in its
entirety.

BACKGROUND

[0004]1. Field

[0005]Embodiments of the invention relate to automated firmware
restoration to a peer programmable hardware device.

[0006]2. Description of the Related Art

[0007]Programmable hardware devices (e.g., a Small Computer System
Interface (SCSI) Enclosure Services (SES) processor in a storage server
or a Universal Serial Bus (USB) controller for a USB device) are found in
many different types of systems. In some cases, the purpose of the
programmable hardware device is to provide reliability, availability, or
serviceability (RAS) features. However, occasionally a programmable
hardware device may require an update to the firmware that is driving its
operation. Firmware may be described as programming that is a permanent
part of a device (e.g., by being inserted into Programmable Read-Only
Memory (PROM)). Also, firmware may be described as programming that is
running on the programmable hardware device, whereas a firmware image may
be described as the set of data that comprises the firmware that gets
loaded onto the programmable hardware device. In many cases, the firmware
that is written to the programmable hardware device will overwrite the
previously operating firmware. Thus, if corrupt firmware (i.e., in the
form of a firmware image) is written to the programmable hardware device,
the programmable hardware device will not operate and, thus, can no
longer provide normal functionality. A programmable hardware device with
corrupt firmware (i.e., with a corrupt firmware image) may be referred to
as a corrupted programmable hardware device. Firmware that is corrupted
may be described as corrupted firmware or invalid firmware.

[0008]An alternative condition that occasionally may occur is that the
firmware runs into an error during normal operation (e.g., when a
firmware image download is not occurring or at runtime), and the error
corrupts the firmware image, thus, also preventing the programmable
hardware device from providing normal functionality.

[0009]Typically, system devices that fail in some way negatively affect
the overall performance of the system, which in most customer
environments is not acceptable. Typically, the conventional means to fix
the problem is to replace the programmable hardware device or, if
possible, reinstall the firmware image. However, these fixes require
intervention from some type of external support, and this type of
intervention is not automatic. If a customer had critical operations that
were negatively affected by any delays, the existing solution of calling
and waiting for support would not be adequate.

[0010]Thus, there is a need in the art to enable automatic self correction
of a firmware problem for a programmable hardware device.

SUMMARY OF EMBODIMENTS OF THE INVENTION

[0011]Provided are a method and computer program product for restoring
firmware. A first programmable hardware device determines that a second
programmable hardware device needs a valid firmware image, retrieves a
copy of the valid firmware image from an external memory, and sends the
valid firmware image to the second programmable hardware device via a
private communication link, wherein the private communication link
enables private communication between the first programmable hardware
device and the second programmable hardware device. The second
programmable hardware device restores existing firmware using the valid
firmware image.

BRIEF DESCRIPTION OF THE DRAWINGS

[0012]Referring now to the drawings in which like reference numbers
represent corresponding parts throughout:

[0013]FIG. 1 illustrates details of two redundant devices in accordance
with certain embodiments of the present invention.

[0014]FIG. 2 illustrates details of two programmable hardware devices that
are not redundant in accordance with certain embodiments of the present
invention.

[0015]FIG. 3 illustrates logic performed by each programmable hardware
device in response to receiving a firmware image in accordance with
certain embodiments.

[0016]FIG. 4 illustrates logic performed by a programmable hardware device
that fails in accordance with certain embodiments.

[0017]FIG. 5 illustrates logic performed by a programmable hardware device
when a partner fails in accordance with certain embodiments.

[0018]FIG. 6 illustrates a system architecture that may be used in
accordance with certain embodiments.

DETAILED DESCRIPTION

[0019]In the following description, reference is made to the accompanying
drawings which form a part hereof and which illustrate several
embodiments of the invention. It is understood that other embodiments may
be utilized and structural and operational changes may be made without
departing from the scope of the invention.

[0020]Embodiments provide an automated firmware restoration to a
programmable hardware device that either has a redundant partner, has a
shared external memory with another processor, or is linked to another
processor that maintains a copy of a valid firmware image for the
programmable hardware device. In certain embodiments, there are two
redundant devices and any firmware update on one of the devices may be
corrected by the other device, provided the failing device is capable of
receiving updates and a communication interface is functioning between
the two devices. Firmware update may be described as update of the
firmware image. In certain embodiments, there are two processors that are
not redundant, but the corrupted processor has a functioning
communication interface to the other processor, and this other processor
either has access to an external memory of the corrupted processor that
contains a copy of the firmware image of the corrupted processor or
maintains a copy of the firmware image of the corrupted processor in some
storage (e.g., the firmware image is sent to the processor during a
firmware update for storage purposes). Embodiments are also applicable to
a condition separate from a firmware update where the processor
encounters some type of error that corrupts the firmware image during
normal operation and either the processor detects the failure during
normal operation or upon reboot.

[0021]Merely to enhance understanding, an explanation of embodiments
applicable to firmware update are provided. However, embodiments are also
applicable to conditions in which a firmware image is corrupted in normal
operation.

[0022]FIG. 1 illustrates details of two redundant devices in accordance
with certain embodiments of the present invention. In FIG. 1, a
Management Module (MM) 100 is an initiating device and is connected via
an external device communication medium 150 to a server 110. The server
110 includes active Baseboard Management Controller (BMC) 120 and standby
Baseboard Management Controller (BMC) 130. The active and standby BMCs
120, 130 provide management capabilities to local resources and redundant
management capabilities to shared resources.

[0024]In FIG. 1, the active and standby BMCs 120, 130 are programmable
hardware devices that are redundant. The active BMC 120 and standby BMC
130 may also be referred to as first and second target devices,
respectively, or as partner BMCs. The active and standby BMCs 120, 130
may be described as dual BMCs for redundancy purposes to improve the
overall server 110 reliability, however there are local components that
each BMC 120, 130 individually controls that are not protected by
redundancy (e.g., one BMC 120, 130 is not able to power components
directly controlled by the partner BMC 120, 130). The BMCs 120, 130 have
an internal private communication link 142 between them along with an
external memory 128, 138 dedicated to each BMC 120, 130. Using these
components 142, 128, 138, embodiments enable automatic firmware
correction.

[0025]The active BMC 120 and the standby BMC 130 are able to communicate
over the private communication link 142. The private communication link
142 may be described as a private device communication medium that
enables private communication between the active and standby BMCs 120,
130. The management module 100 is not able to send communications
directly on the private communication link 142. An internal device
communication medium 140 is coupled to the external device communication
medium 150. In certain embodiments, the external device communication
medium 150 may be a bus (e.g., a RS485 serial bus inter-integrated
circuit (I2C) bus, a Dual Port RAM (DPRAM), or other bus-based media),
and the internal device communication medium 140 may be a bus (e.g., an
internal RS485 serial bus, inter-integrated circuit (I2C) bus, a Dual
Port RAM (DPRAM), or other bus-based media) connected to the external
device communication medium 150.

[0026]In normal operation, the management module 100 sends a single
firmware image (through multiple packets) to both of the active and
standby BMCs 120, 130 by issuing the firmware image over a single
external device communication medium 150. Each BMC 120, 130 in the normal
update process writes the firmware image directly to an internal memory
124, 134 (e.g., an internal flash area) that stores the BMC instruction
set that drives the processor 122, 132. In the event that the firmware
image is corrupt, each BMC 120, 130 has the limited capability to wait
for a new firmware update through a set of reduced functionality provided
in code that is not updated in the firmware update process (i.e., code
that is stored in what is referred to as a boot block). However, normal
functions of the BMC 120, 130 can no longer be provided when the firmware
image is corrupt. In server 110, this may be a problem because each BMC
120, 130 controls system power in the server 110. So, if the server 110
is powered down for a firmware update and the update fails (i.e., the
firmware image is corrupt), the server 110 is not able to power on until
a valid firmware update is provided, meaning a customer would lose
usability of the blade server. A "valid" firmware update may be described
as one that is not corrupted and enables the BMC 120, 130 to start up
perform its functionality correctly. An example result of corrupted
firmware is a situation in which the firmware update completes
successfully, and the BMC 120, 130 starts up as normal, but then some of
its normal functionality is inhibited by the new, corrupted firmware.
Even though there is a second BMC 120, 130, in the event of a firmware
failure, the second BMC 120, 130 is not able to power components directly
controlled by the partner BMC 120, 130, which underscores the importance
of both BMCs 120, 130 being in an operational state.

[0027]With embodiments, the active BMC 120 is coupled to an external
memory 128 in which the active BMC 120 stores a copy of the firmware
image. The standby BMC 130 is coupled to an external memory 138 in which
the standby BMC 130 stores a copy of the firmware image. In the case of
redundant devices, each redundant device runs the same firmware.
Therefore, the copies of the firmware images in the external memories
128, 138 are the same. With these copies, if one of the BMCs 120, 130 has
firmware that is corrupt, the other BMC 120, 130 is able to provide a
corresponding firmware image from the external memory 128, 138.

[0028]In certain embodiments, the server 110 is a blade server in an
IBM® BladeCenter® chassis (available from International Business
Machines Corporation), where the blade server has dual baseboard
management controllers (BMCs). The blade server may be described as an
midrange server class storage system. However, embodiments are applicable
to any set of target devices (e.g., redundant devices, such as the active
and standby BMCs 120, 130), may use any shared communication medium
(e.g., internal device communication medium 140) between the target
devices that permits "snooping" in data sniffing mode, and may use any
private device communication medium (e.g., private communication link
142) between the redundant target devices. Data sniffing mode (also
referred to as "promiscuous" mode) may be described as a mode in which a
target device intercepts and reads each communication (e.g., network
packet) that arrives in its entirety, whether or not the communication is
addressed to that target device. Embodiments may be used in networks that
are serial or non-serial. Although examples herein may refer to firmware
update, embodiments are applicable to software updates. Also, there may
be any number of devices that receive the update over a same device
communication medium. The active and standby BMCs 120, 130 may be
described as control entities in the blade server used by an IBM®
BladeCenter® Management Module (MM).

[0029]During normal operation, one BMC (e.g., the active BMC 120) is said
to "own" the external device communication medium 150, and so one BMC is
capable of communicating with the management module 100 at a time.
Although the management module 100 is aware of sending commands to the
active BMC 120, the management module 100 does not speak directly to the
active BMC 120. Instead, the management module 100 sends messages to an
address on the external device communication medium 150 that is
associated with the server 110 slot, and the active and standby BMCs 120,
130 are capable of responding to and/or listening on this address.
Therefore, from the management module 100 perspective, the management
module is speaking to one BMC at any moment in time. In certain
embodiments of the dual BMCs 120, 130 in a server 110, there is no
hardware inhibitor that prevents both BMCs from actively using the
external device communication medium 150 (where the standby BMC 130 may
access the external device communication medium 150 via the internal
device communication medium 140). In certain embodiments, however, the
external device communication medium 150 is actively used by a default
BMC that is defined as the active BMC 120, and the other BMC 130 remains
in an inactive state with the internal device communication medium 140
until the active BMC 120 that is actively using the external device
communication medium 150 fails.

[0030]FIG. 2 illustrates details of two programmable hardware devices that
are not redundant in accordance with certain embodiments of the present
invention. In FIG. 2, a Management Module (MM) 200 is an initiating
device and is connected via an external device communication medium 250
to a server 210. The server 210 includes programmable hardware device A
220 and programmable hardware device B 230. The programmable hardware
device A 220 includes processor 222 and internal memory 224. The
programmable hardware device B includes processor 232 and internal memory
234.

[0031]In FIG. 2, the programmable hardware devices 220, 230 may also be
referred to as partners. The programmable hardware devices 220, 230 have
an internal private communication link 242 between them, and each
programmable hardware device 220, 230 has an external memory 228, 238.
Dashed line 290 indicates that programmable hardware device A 220 is
optionally coupled to external memory 238, while dashed line 292
indicates that programmable hardware device B 230 is optionally coupled
to external memory 228. In this manner either external memory 228, 238 or
both external memories 228, 238 may function as a shared external memory.
In certain embodiments, the external memory 228, 238 is dedicated to one
programmable hardware device 220, 230 and accessible by the other
programmable hardware device 220, 230. With the embodiments illustrated
in FIG. 2, the programmable hardware devices 220, 230 do not use the same
firmware, but each is able to access a copy of the other programmable
hardware device 220, 230 firmware image (e.g., either from its own
external memory 228, 238 or from the other programmable hardware device's
external memory 228, 238). Then, If one programmable hardware device 220,
230 fails due to firmware being corrupt, then the other programmable
hardware device 220, 230 is able to provide a valid firmware image.

[0032]Although embodiments refer to an external memory 128, 138, 228, 238,
any storage space external to the programmable hardware device 120, 130,
220, 230 may be used as long as the storage space is non-volatile or in
some way protects the contents such that the firmware in the memory is
not lost (e.g., during a power off phase).

[0034]With embodiments, each programmable hardware device (e.g., BMC 120,
130 or devices 220, 230) makes use of the new hardware component
connected to that programmable hardware device, the external memory. FIG.
3 illustrates logic performed by each programmable hardware device 120,
130, 220, 230 in response to receiving a firmware image in accordance
with certain embodiments. Control beings at block 300 with a programmable
hardware device 120, 130, 220, 230 receiving a firmware image (i.e., a
new firmware image that may either be the first firmware image received
or that may be an update to previously received firmware image). In block
302, the programmable hardware device 120, 130, 220, 230 stores the
firmware image in external memory. Each external memory 128, 138, 228,
238 is large enough to hold multiple copies of a firmware image (e.g.,
multiple versions, which enables a programmable hardware device 120, 130,
220, 230 to obtain a particular version of the firmware image). In
certain embodiments, the number of copies is determined by how many
copies the memory supports, while in other embodiments, other factors may
be used in addition to or instead of memory size.

[0035]In particular, during normal firmware operations, each BMC 120, 130
copies the firmware image to an area in the external memory 128, 138.
Each BMC 120, 130 keeps two or more copies. For non-redundant
programmable hardware devices 220, 230 that do share one or more external
memories 228, 238, each programmable hardware device 220, 230 may store a
copy of its firmware image in the shared external memory 228, 238. On the
other hand, for non-redundant programmable hardware devices 220, 230 that
do not share one or more external memories 228, 238, each programmable
hardware device 220, 230 receives a copy of the firmware image of the
other programmable hardware device 220, 230 and stores that copy in its
own external memory 228, 238.

[0036]FIG. 4 illustrates logic performed by a programmable hardware device
120, 130, 220, 230 that fails in accordance with certain embodiments.
Control begins at block 400 with the programmable hardware device 120,
130, 220, 230 optionally determining that a valid firmware image is
needed and notifying a partner. In certain embodiments, rather than the
programmable hardware device 120, 130, 220, 230 notifying the partner,
the partner automatically detects the failure and sends a valid firmware
image. In block 402, the programmable hardware device 120, 130, 220, 230
receives a copy of the valid firmware image. In certain embodiments, the
programmable hardware device 120, 130, 220, 230 receives the copy from
the partner. In certain alternative embodiments, the programmable
hardware device 120, 130, 220, 230 obtains a copy of the valid firmware
image from its external memory 128, 138, 228, 238. In block 406, the
programmable hardware device 120, 130, 220, 230 is restored using the
received copy of the valid firmware image. In block 406, the programmable
hardware device 120, 130, 220, 230 optionally stores a copy of the valid
firmware image in its external memory 128, 138, 228, 238 (e.g., if the
firmware image was not retrieved from its own external memory 128, 138,
228, 238).

[0037]FIG. 5 illustrates logic performed by a programmable hardware device
120, 130, 220, 230 when a partner fails in accordance with certain
embodiments. Control begins at block 500 with the programmable hardware
device 120, 130, 220, 230 determining that the partner needs a valid
firmware image. The determination may be made by the programmable
hardware device 120, 130, 220, 230 automatically or the programmable
hardware device 120, 130, 220, 230 may receive an indication of failure
from the partner. In block 502, the programmable hardware device 120,
130, 220, 230 retrieves a copy of the valid firmware image from the
external memory 128, 138, 228, 238. In various embodiments, the copy of
the valid firmware image may be retrieved from an external memory
dedicated to the programmable hardware device 120, 130, 220, 230 or from
a shared external memory. In block 504, the programmable hardware device
120, 130, 220, 230 sends a copy of the valid firmware image to the
partner via the private communication link 142, 242.

[0038]With reference to FIG. 1 and embodiments in which there are two
redundant devices, in the event of a corrupt firmware image being passed
from the management module 100 to the BMC 120, 130, since one BMC 120,
130 is updated at a time, the firmware update process may fail without
actions performed in accordance with embodiments. In certain embodiments,
the corrupted BMC 120, 130 directly accesses its external memory 128, 138
for a copy of a firmware image to update itself. In certain embodiments,
if the BMC 120, 130 does not find a valid firmware image in the external
memory 120, 130 or if the corrupted BMC 120, 130 is designed to provide
an indication to a partner, the corrupted BMC 120, 130 provides an
indication to the partner BMC 120, 130 that a valid firmware image is
needed. The partner BMC 120, 130 is in a position to rollback the
firmware of the corrupted BMC 120, 130 by providing the corrupted BMC
120, 130 with the last valid firmware image in its external memory 128,
138 or by directly reading its own firmware image and providing that to
the other BMC 120, 130 via the private communication link 142. The
operating BMC 120, 130 acts as the management module that initiates the
firmware update process and updates the partner BMC 120, 130 that was
corrupted. Control to initiate a firmware update of a BMC 120, 130 may be
placed in the valid partner's domain with a heartbeat mechanism. A
heartbeat mechanism may be described as one in which a programmable
hardware device 120, 130, 220, 230 periodically sends a message to a
partner and receives a message from the partner to determine whether the
partner is still functioning. For example, when a programmable hardware
device 120, 130, 220, 230 does not respond to the heartbeat function
message within a predetermined period of time, the partner programmable
hardware device 120, 130, 220, 230 determines that the programmable
hardware device 120, 130, 220, 230 has failed and needs a valid firmware
image. Thus, upon completion, both BMCs are able to operate again in a
redundant fashion.

[0039]With reference to FIG. 2 and embodiments in which the devices are
not redundant, a shared external memory 228, 238 may be available between
processors 222, 232 or both processors 22Z, 232 may maintain the
partner's firmware image in separate dedicated external memories 228,
238. The processors 222, 232 do not have to be redundant but have a
functioning communication interface (e.g., private communication link
242). In various embodiments, either each processor 222, 232 implements a
heartbeat function during normal operation or the corrupted processor
222, 232 indicate a failure to the partner processor 222, 232. For
firmware updates, the heartbeat mechanism stops temporarily, since the
processor 222, 232 being updated is not in a position to handle that
functionality (i.e., because, typically, the functionality is in
operational code that cannot be accessed during a firmware update).
However, a timeout may be implemented that expires if the heartbeat
mechanism was not started in time (i.e., the timeout may be defined as
the longest time allowable for a firmware update for the partner
processor). In the event of a timeout, the partner processor 222, 232
automatically retrieves a firmware image from external memory 228, 238 to
update the corrupted processor 222, 232. In certain embodiments, the
corrupted processor 222, 232 requests and receives a firmware image of
the last version of firmware that is usable from the partner processor
222, 232. The firmware image may be provided over the private
communication link 242 between the processors 222, 232 or, alternatively,
the corrupted processor's internal memory 224, 234 (e.g., an internal
flash area) may be directly updated if the partner processor 222, 232 has
the ability to directly update the internal memory 224, 234 by writing
directly to a utility that updates a processor's internal memory 224, 234
such as a Joint Test Action Group (JTAG) interface. Upon completion, the
corrupted processor has a valid firmware image.

[0040]Using these embodiments, a programmable hardware device 120, 130,
220, 230 that receives a corrupt firmware image is automatically restored
with a valid firmware image. With embodiments, when the programmable
hardware device 120, 130, 220, 230 encounters an error during normal
operation that corrupts its firmware or firmware image, the programmable
hardware device 120, 130, 220, 230 is able to be restored using a valid
firmware image from its own external memory 128, 138, 228, 238 or from a
partner 120, 130, 220, 230. An example of a condition that could corrupt
a firmware image during normal operation is an invalid memory access that
causes the firmware to execute invalid code or access garbage data. In
this case, if the programmable hardware device 120, 130, 220, 230 has the
capability to identify that it is in a corrupt state during normal
operation, the programmable hardware device 120, 130, 220, 230 is able to
be restored with a valid firmware image. Also, if the programmable
hardware device 120, 130, 220, 230 reboots because of a timeout (e.g., a
timeout of a watchdog timer), and, during its initialization, the
programmable hardware device 120, 130, 220, 230 identifies that there was
a problem, the programmable hardware device 120, 130, 220, 230 is able to
halt normal boot up to obtain a valid firmware image. A watchdog timer
may be described as a timer that is to be periodically reset by hardware,
and, if the timer is not reset, the system enters a failure state.

[0042]The corrupt firmware image may be restored during firmware update or
during normal operation. Embodiments enable rolling back to a valid
firmware image. Embodiments also provide a heartbeat mechanism to detect
failure of a partner 120, 130, 220, 230 during normal operation.

ADDITIONAL EMBODIMENT DETAILS

[0043]The described operations may be implemented as a method, computer
program product or apparatus using standard programming and/or
engineering techniques to produce software, firmware, hardware, or any
combination thereof.

[0044]Each of the embodiments may take the form of an entirely hardware
embodiment, an entirely software embodiment or an embodiment containing
both hardware and software elements. The embodiments may be implemented
in software, which includes but is not limited to firmware, resident
software, microcode, etc.

[0045]Furthermore, the embodiments may take the form of a computer program
product accessible from a computer-usable or computer-readable medium
providing program code for use by or in connection with a computer or any
instruction execution system. For the purposes of this description, a
computer-usable or computer readable medium may be any apparatus that may
contain, store, communicate, propagate, or transport the program for use
by or in connection with the instruction execution system, apparatus, or
device.

[0046]The described operations may be implemented as code maintained in a
computer-usable or computer readable medium, where a processor may read
and execute the code from the computer readable medium. The medium may be
an electronic, magnetic, optical, electromagnetic, infrared, or
semiconductor system (or apparatus or device) or a propagation medium.
Examples of a computer-readable medium include a semiconductor or solid
state memory, magnetic tape, a removable computer diskette, a rigid
magnetic disk, an optical disk, magnetic storage medium (e.g., hard disk
drives, floppy disks, tape, etc.), volatile and non-volatile memory
devices (e.g., a random access memory (RAM), DRAMs, SRAMs, a read-only
memory (ROM), PROMs, EEPROMs, Flash Memory, firmware, programmable logic,
etc.). Current examples of optical disks include compact disk-read only
memory (CD-ROM), compact disk-read/write (CD-R/W) and DVD.

[0047]The code implementing the described operations may further be
implemented in hardware logic (e.g., an integrated circuit chip,
Programmable Gate Array (PCA), Application Specific Integrated Circuit
(ASIC), etc.). Still further, the code implementing the described
operations may be implemented in "transmission signals", where
transmission signals may propagate through space or through a
transmission media, such as an optical fiber, copper wire, etc. The
transmission signals in which the code or logic is encoded may further
comprise a wireless signal, satellite transmission, radio waves, infrared
signals, Bluetooth, etc. The transmission signals in which the code or
logic is encoded is capable of being transmitted by a transmitting
station and received by a receiving station, where the code or logic
encoded in the transmission signal may be decoded and stored in hardware
or a computer readable medium at the receiving and transmitting stations
or devices.

[0048]A computer program product may comprise computer useable or computer
readable media, hardware logic, and/or transmission signals in which code
may be implemented. Of course, those skilled in the art will recognize
that many modifications may be made to this configuration without
departing from the scope of the embodiments, and that the computer
program product may comprise any suitable information bearing medium
known in the art.

[0049]The term logic may include, by way of example, software, hardware,
firmware, and/or combinations of software and hardware.

[0050]Certain implementations may be directed to a method for deploying
computing infrastructure by a person or automated processing integrating
computer-readable code into a computing system, wherein the code in
combination with the computing system is enabled to perform the
operations of the described implementations.

[0051]The logic of FIGS. 3, 4 and 5 describes specific operations
occurring in a particular order. In alternative embodiments, certain of
the logic operations may be performed in a different order, modified or
removed. Moreover, operations may be added to the above described logic
and still conform to the described embodiments. Further, operations
described herein may occur sequentially or certain operations may be
processed in parallel, or operations described as performed by a single
process may be performed by distributed processes.

[0052]The illustrated logic of FIGS. 3, 4, and 5 may be implemented in
software, hardware, programmable and non-programmable gate array logic or
in some combination of hardware, software, or gate array logic.

[0053]FIG. 6 illustrates a system architecture 600 that may be used in
accordance with certain embodiments. Client computer 100 and/or server
computer 120 may implement system architecture 600. The system
architecture 600 is suitable for storing and/or executing program code
and includes at least one processor 602 coupled directly or indirectly to
memory elements 604 through a system bus 620. The memory elements 604 may
include local memory employed during actual execution of the program
code, bulk storage, and cache memories which provide temporary storage of
at least some program code in order to reduce the number of times code
must be retrieved from bulk storage during execution. The memory elements
604 include an operating system 605 and one or more computer programs
606. The memory elements 604 may also include code 630 that implements
some or all of the described operations taught by embodiments of the
invention. Although code 630 is shown, the described operations taught by
embodiments of the invention may alternatively be implemented in hardware
or in a combination of hardware and software.

[0054]Input/Output (I/O) devices 612, 614 (including but not limited to
keyboards, displays, pointing devices, etc.) may be coupled to the system
either directly or through intervening I/O controllers 610.

[0055]Network adapters 608 may also be coupled to the system to enable the
data processing system to become coupled to other data processing systems
or remote printers or storage devices through intervening private or
public networks. Modems, cable modem and Ethernet cards are just a few of
the currently available types of network adapters 608.

[0056]The system architecture 600 may be coupled to storage 616 (e.g., a
non-volatile storage area, such as magnetic disk drives, optical disk
drives, a tape drive, etc.). The storage 616 may comprise an internal
storage device or an attached or network accessible storage. Computer
programs 606 in storage 616 may be loaded into the memory elements 604
and executed by a processor 602 in a manner known in the art.

[0057]The system architecture 600 may include fewer components than
illustrated, additional components not illustrated herein, or some
combination of the components illustrated and additional components. The
system architecture 600 may comprise any computing device known in the
art, such as a mainframe, server, personal computer, workstation, laptop,
handheld computer, telephony device, network appliance, virtualization
device, storage controller, etc.

[0058]The foregoing description of embodiments of the invention has been
presented for the purposes of illustration and description. It is not
intended to be exhaustive or to limit the embodiments to the precise form
disclosed. Many modifications and variations are possible in light of the
above teaching. It is intended that the scope of the embodiments be
limited not by this detailed description, but rather by the claims
appended hereto. The above specification, examples and data provide a
complete description of the manufacture and use of the composition of the
embodiments. Since many embodiments may be made without departing from
the spirit and scope of the embodiments, the embodiments reside in the
claims hereinafter appended or any subsequently-filed claims, and their
equivalents.