Introduction

The goal of this project is to quantify the reliability of the Blue Waters memory subsystem i.e., if it fails, how frequently it fails, and how it fails. Syslog data (particularly pertaining to Machine Check Exceptions) from the system has been chunked into 2 month periods, pre-processed (parsed and cleaned up) and tabulated for use in this project. Links to datasets with 2 months of log-data have been provided below.

Please read this entire document and plan ahead before starting with the project. In particular, Task 3 will have you repeat all previous tasks in addition to spending a lot of time (computationally) processing datasets. Start early and ask questions!!

Information about Dataset

The data provided to you has the following fields:NOTE: These fields are explained in the AMD Processor Manuals. You will get detailed descriptions about the logged
errors and the architecture/system components that they are related to. Numbers with ticks in dataset are in binary.

NodeID - NOT USED IN THIS STUDY

Date Time - Date in format yyyy-mm-dd HH:MM:SS

Complete Node - Complete node id in the form of cX-YcZsKnT where X-Y is the cabinet coordinate, Z is the chassis in the cabinet, K the slot and T the node number

LDT Link - For errors associated with a hypertransport link, this field indicates which link was associated with an error

Scrub - ECC error detected by the scrubber

Link - Link in error

Cache way in error - Indicates that the cache-way in error

Syndrome - Syndrome of the corrected ECC/Chipkill error

Core - ID of core in which error has occurred

Errorcode - The MCi_STATUS error information

Ext_errorcode - Logs an extended error code when an error is detected. Used in conjunction with ErrorCode

Error Type - Type of error

Addr - Address generating the machine check

Addr Desc - Type of address

Errorcode Type - Type of Machine Check Exception

Misc - Miscellaneous data (not used)

Summary of Nodes in the System

Compute Nodes

22640 XE and 784 Service nodes

2x AMD OPTERON Processor

8x 8GB DIMMS DDR3

CHIPKILL 8x/4x

GPU Nodes

Before Aug 2013: 3072 XK nodes

1x AMD OPTERON Processor

4x 8GB DIMMS DDR3

1 NVIDIA K20X 6GB DDR5

CHIPKILL 8x/4x

After Aug 2013: 4224 XK nodes

1x AMD OPTERON Processor

4x 8GB DIMMS DDR3

1 NVIDIA K20X 6GB DDR5

CHIPKILL 8x/4x

Project Tasks

Task 0 - Get familiar with the analysis environment

Import your assigned dataset and filter out bad entries.Hint: Take a look at the min/max timestamps. What time range does your data cover?Hint: You may need to filter data based on additional columns

Summarize the following information:

Total number of entries

Unique number of nodes

Number of days

Unique node types

Total number of uncorrectable errors

Different type of machine check exceptions

How would you define error and failure in this data?Hint: Refer to lecture 4 for an introduction to reliability engineering

Count the number of MCEs per node. Provide a box plot to summarize your results.

Compute the mean time between MCEs for:

All nodes together (the whole dataset)

Each of the node types (i.e. XE, XK, etc)

Task 1 - Analysis of Machine Check Exceptions Rates

Plot the time to MCE distribution. Does this fit any known distribution (e.g., Gaussian, Weibull, Exponential)?

What percentage of MCEs is due to memory errors?Hint: Which bank generates memory errors? Take a look at the AMD developers manual

Provide a breakdown of the number, type (e.g., ECC, L1, L2, memory) and % of machine check for the entire dataset and per node type.
Construct a bar chart to visualize your results.

What is a correctable error, uncorrectable error and deferred error in this dataset?

Are there any uncorrectable errors?
If yes, provide a histogram for the TBF for uncorrectable errors.
Compute a separate MTBF and FIT for uncorrectable errors.

FIT is defined as the number of failures in 109 hours of operationNote: Your time range should be based on the entire dataset, not the filtered dataset with only uncorrectable errors. Make sure you are consistent with your time units.

Use a table to summarize the data, for all node types (ALL, XE, XK, service)

Use the x8 syndrome table in the AMD processor manual (section 2.13.2.5) to understand how to solve this problem

How frequent (time) are multiple (>1) bit errors?

Provide one or two charts of your choice to motivate your answer.

Do different types of nodes (XE, XK, service) behave differently in terms of the frequency of multiple bit errors?

Test the following hypothesis: XK nodes perform worse (have a higher rate of memory errors) than XE nodes.
Remember to normalize rates based on memory capacities of these node types.

How many uncorrectable errors would Blue Waters have if it only used ECC SEC-DED (single bit error correction)?
Blue waters uses an improved version of ECC which can correct multi-bit errors (as seen in your dataset).
How effective is this improved ECC over regular ECC?

Compare the FIT and MTBF (only for uncorrectable errors) considering the same system with regular ECC and improved ECC.

Summarize your answer in 2-3 sentences.

Task 3 - Data Coalescing

Coalesce your dataset using the Sliding Window algorithm. Justify your window size by providing a knee curve. The Sliding Window Algorithm is provided below.
Note: You should only coalesce entries that originate from the same node. Your knee curve will be based on the sum of tuples of your coalescing for each node.