Kiarash Majdi | Full Stack AI Developer

Ideas for dynamic RAID

Data

We treat the calculation of the probability of a data loss two ways:

on a drive basis
on a pod basis

Drive basis

Things that we consider on a drive basis:

individual probability of failure discounting correlated failures (i.e. the earlier bathtub curve based on model)
individual read/write per day rate for the past 7 days

Pod basis

Reasons why we want to look at most of the decision on a pod basis:

Pods consist of 20 drives that are guaranteed to be part of the same RAID and hence the same data pool. A lot of the features are collectively significant to the experiment rather than individually.
The concept of data loss in a RAID is not determined on the scale of individual drives, but on the scale of a pod.
A lot of the reasons for correlated failures are captured in a pod snapshot but not in an individual drive snapshot.
As obvious as it is that drives in a pod have reasons to exhibit correlated failures, groups of a larger scale such as clusters or vaults are not that obvious. So pods are probably the ideal spot.

SMART attributes to consider

For reference, see the USENIX FAST '20 paper.

The entirety of the attributes above are used on a pod basis (mean or sum) (if the value presents a total, differentiated mean or sum).

On top of this, attributes of total read and total write are differentiated and the maximum of the whole pod is used.

Other important attributes to consider

On a pod basis, we consider:

the number of failures experienced in the past 7 days
days since the latest failure in the pod

Modeling

Per-drive inputs to the pod model

We have one model that evaluates a failure score for the drive based on its spot on the bathtub curve corresponding to its model. Furthermore, SMART attributes are aggregated and passed to not this model, but the pod model.

Pod model

A classification model:

$$P(loss\ | \ state)$$

demonstrates the probability that we experience loss given the earlier state of the pod.

We can approach this modelling a few ways:

Way 1:

A simple linear model connected to a logit link function to determine a probability between 0 and 1 for data loss based on the parameters discussed above

Way 2:

A classification decision tree (traditional machine learning) using the parameters above

Way 3:

A LSTM model (recurrent neural network, more modern machine learning) taking in a sequence of earlier pod states and generates its guessed new pod state

Way 4:

A transform model (similar to what powers a large language model, very complicated and cutting edge machine learning) to predict the future of the state of a pod several days through