## AUTHOR QUERIES AUTHOR PLEASE ANSWER ALL QUERIES

PLEASE NOTE: We cannot accept new source files as corrections for your paper. If possible, please annotate the PDF proof we have sent you with your corrections and upload it via the Author Gateway. Alternatively, you may send us your corrections in list format. You may also upload revised graphics via the Author Gateway.

Carefully check the page proofs (and coordinate with all authors); additional changes or updates WILL NOT be accepted after the article is published online/print in its final form. Please check author names and affiliations, funding, as well as the overall article for any errors prior to sending in your author proof corrections. Your article has been peer reviewed, accepted as final, and sent in to IEEE. No text changes have been made to the main part of the article as dictated by the editorial level of service for your publication.

- AQ1: According to our records, Luca Crippa is listed as a Member, IEEE. Please verify. AQ2: Please confirm or add details for any funding or financial support for the research of this article.
- AQ3: Please confirm if the location and publisher information for Reference [33] is correct as set.

# Investigating 3D NAND Flash Read Disturb Reliability With Extreme Value Analysis

Cristian Zambelli<sup>10</sup>, Member, IEEE, Luca Crippa, Member, IEEE, Rino Micheloni, Senior Member, IEEE,

and Piero Olivo

*Abstract*—The storage systems relying on the 3D NAND Flash technology require an extensive modeling of their reliability in different working corners. This enables the deployment of system level management routines that do not compromise the overall performance and reliability of the system itself. Dedicated parametric statistical models have been developed so far to capture the evolution of the memory reliability, although limiting the description to an average behavior rather than extreme cases that can disrupt the storage functionality. In this work, we validate the application of an extreme statistics tool, namely the Points-Over-Threshold method, to characterize the read disturb reliability of a 3D NAND Flash chip. Such technique proved that the die reliability characterized through extreme events analysis can be predicted using a low number of samples and generally holds good prediction features for distribution tail events.

Index Terms—3D-TLC NAND flash, read disturb, reliability,
points over threshold.

I. INTRODUCTION

ODELLING the reliability of the 3D NAND Flash 19 memory technology [1], [2] is still an important task 20 21 to be performed as a support for storage system designers 22 dedicated to Solid State Drives (SSDs) or Multi Media Card 23 (MMC) products development. Indeed, all the firmware solu-24 tions implemented in their controllers (i.e., the computing core 25 of SSDs and MMCs) whose goal is mitigating the inherent bit <sup>26</sup> error rate (BER) exposed in different storage working condi-27 tions (e.g., endurance stress, data retention at high temperature, 28 etc.) are well founded on memory reliability models [3]. To <sup>29</sup> this extent, dedicated parametric statistical models [4]–[6] have 30 been developed so far to capture the evolution of the memory's 31 errors distribution through well-known statistical frameworks 32 (defined as probability distributions) like Gaussian, Binomial, 33 Poisson, Gamma, and so on. However, the large process-34 induced variability of the error characteristics in 3D NAND 35 Flash devices [7] combined with an intrinsic difficulty in test-36 ing all the possible permutations of the memory working 37 corners during lifetime and on a relevant statistical population,

Manuscript received August 2, 2021; accepted August 27, 2021. (Corresponding author: Cristian Zambelli.)

Cristian Zambelli and Piero Olivo are with the Dipartimento di Ingegneria, Universita degli Studi di Ferrara, 44122 Ferrara, Italy (e-mail: cristian.zambelli@unife.it).

Luca Crippa and Rino Micheloni are with the Flash Signal Processing Labs, Microchip Corporation, 20871 Vimercate, Italy.

Color versions of one or more figures in this article are available at https://doi.org/10.1109/TDMR.2021.3108941.

Digital Object Identifier 10.1109/TDMR.2021.3108941

could hamper an accurate description of the distribution upper tail. It is worth to mention that on this part of the errors distribution, the storage system designers spend a significant effort to tailor the Error Correction Codes (ECCs) [8] strength and the secondary correction schemes like soft decoding [9], [10], Moving Read References [11], and even RAID [12]–[15]. Therefore, the more precise and accurate is the model the less is the probability to incur in storage performance slowdowns due to improperly calibrated error correction techniques [16].

In [17], we proposed for the first time to apply a parametric 47 model commonly exploited in extreme value analysis (EVA) 48 for natural sciences and econometrics to 3D NAND Flash 49 errors distribution upper tail modeling, namely the Point Over 50 Threshold (POT) method [18]. The work demonstrated the 51 potential of this methodology combined with the Generalized 52 Pareto Distribution (GPD) in the analysis of the read disturb 53 stress after data retention. We picked that memory working 54 condition since it is known to represent a critical use case in 55 data center applications for Big Data analytics performed on 56 cold data [19]. Starting from our previous work, we extended 57 the study in twofold directions: the first related to the char-58 acterization and estimation of the read disturb also for hot 59 data scenario (i.e., read after many updates) mimicked by an 60 endurance stress, showing that it is not critical for the reli-61 ability as in retention conditions; the second concerning the cross-validation of the POT and the extension of its implemen-63 tation with a three-parameters Weibull distribution. This will 64 demonstrate, with a proper confidence, that POT is a powerful 65 statistical tool to estimate the 3D NAND Flash die reliability. 66

### II. EXPERIMENTAL SETUP

### A. Devices Under Test

The experimental activity in this work is based on the char-69 acterization of an off-the-shelf sub-100 layers 3D NAND Flash 70 memory product implementing the Triple Level Cell (TLC) 71 paradigm (see Fig. 1). Such technology is considered, based on 72 its endurance and retention rating, as a mass storage medium 73 for enterprise SSD applications. The statistical sample under 74 investigation is composed by all the pages in every physical 75 layer of 40 memory blocks distributed on multiple dies and 76 chips to account for process-induced variability [7]. Since we 77 are testing a TLC memory, we consider all the page types in 78 the analysis (i.e., LSB-Least Significant Bit page, CSB-Center Significant Bit page, and MSB-Most Significant Bit page). 80 Each page is sized 16 Kbytes plus the spare bytes exploited for 81

1530-4388 © 2021 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission.

See https://www.ieee.org/publications/rights/index.html for more information.

18

AQ2

67

68



Fig. 1. TLC 3D NAND Flash architecture considered in this work [17].

<sup>82</sup> data recovery purposes in case of data corruption. However, <sup>83</sup> state-of-the-art ECCs implementations [8] work on subset of <sup>84</sup> the page dimension, generally referred as a codeword (CW). <sup>85</sup> In this work, the CW size is 4 Kbytes plus the spare bits, so <sup>86</sup> that every page is constituted by 4 CWs. The statistical sample <sup>87</sup> under test is constituted by 184320 CWs.

### 88 B. Test Flow for Reliability Assessments

The execution of the test flows required to extract the exper-89 <sup>90</sup> imental data to fit with the statistical models presented in the <sup>91</sup> paper is performed by the automated test equipment (ATE) <sup>92</sup> presented in [20]. The system interfaces with the 3D NAND 93 Flash chips at a 400 MT/s data rate and allows, for any applied 94 test, to measure the number of corrupted bits (also known 95 as fail bits count or errors number) in each CW. We remind <sup>96</sup> that a bit is considered corrupted after reading from the 3D 97 NAND Flash under test if its value changes from what has <sup>98</sup> been previously written as a result of a reliability degrada-<sup>99</sup> tion process. Fig. 2 summarizes the test procedure performed. 100 After the definition of the statistical sample under test, we 101 write a random pattern on all the TLC pages (therefore on 102 all the CWs) to rule out any topological dependency of the 103 corrupted bits and perform a readout of the memory content. 104 Then, the devices are submitted to an endurance stress up to 105 the rated endurance of the technology (i.e., maximum num-106 ber of sustainable block erase before unrecoverable errors) 107 using a JEDEC-based cycling test [21]. The test consists in 108 3000 Program/Erase cycles at a 61 °C temperature for 500 109 hours. After the stress we perform a readout of all the CWs 110 under test and we extract the number of corrupted bits for 111 each CW. A read disturb is then performed post-endurance by 112 employing a 1000 uniform block reads access pattern [22] on 113 all tested blocks and performed another readout for fail bits 114 count extraction. Immediately after the end of the read disturb <sup>115</sup> post-endurance stress, a data retention stress is performed by <sup>116</sup> placing all the devices under test in idle for 90 days at a 40 °C <sup>117</sup> temperature. A double readout is performed at the end of the 118 test to separate the Temporary Read Errors (TRE) effect typ-119 ical of 3D NAND Flash architectures [23] from the retention 120 stress results. Finally, an additional read disturb post-retention 121 is performed and the fail bits count per CW are extracted 122 accordingly.



Fig. 2. Depiction of the test flow adopted in this work for the 3D NAND Flash reliability characterization.

### III. 3D NAND FLASH READ DISTURB CHARACTERIZATION

123

124

We started the read disturb characterization on our devices 125 by evaluating the Empirical Cumulative Distribution Function 126 (ECDF) of the fail bits count extracted on all the CWs before 127 and after the read disturb stress post-endurance and post- 128 retention scenarios, as described in Fig. 2. For a given number 129 of fail bits equal to t in a sample x, the ECDF is defined as 130the proportion of the values in  $x \leq t$ . We prefer to use the 131 term ECDF rather than CDF since the latter one can be mis- 132 leading as it is usually referenced to a theoretical probability 133 distribution used to fit the experimental data, which is not our 134 case here. Unfortunately, due to confidentiality reasons on the 135 tested 3D NAND Flash samples we cannot disclose the ECDF 136 of the fail bits count, but we have to normalize the number 137 of corrupted bits on a CW to a defined entity. In this work, 138 we normalized the fail bits count with respect to the ECC 139 capacity offered by an advanced correction engine that incor- 140 porates secondary error correction schemes (e.g., read retry 141 and soft-decoding) as well [9], [10], [24]. 142

Fig. 3 shows that after endurance stress the read disturb generally increases the number of errors as expected from other studies in this context [25], [26]. This behavior is related to an over-programming of the memory cells that are not involved by the read operation due a moderate voltage applied to unselect them [27]. Looking at the median of the ECDFs per page type it is observed that the CWs belonging to MSB pages are those displaying the largest errors increase, even if there are some CWs in LSB pages (those in the associated ECDF tail) that are largely affected. However, the read disturb post-endurance is not particularly detrimental for the reliability since the ratio fail bits count/ECC capacity is well below one and displays a sufficient margin for safe operation without data corruption.

Fig. 4 shows the results of the same analysis replicated for 157 the retention domain. In the errors analysis we also reported 158 the ECDFs retrieved on the first readout of the memory blocks 159 under test to evaluate the impact of the TRE, as discussed 160 in the Section II of this work. We interestingly observe that 161 the read disturb applied post-retention can recover part of the 162 errors in a CW for all TLC page types. This has been explained 163 in [26], by a charge redistribution mechanism occurring during 164



Fig. 3. ECDFs per TLC page type of the fail bits count normalized with respect to the ECC capacity in post-endurance stress and after the application of the read disturb. The ECDFs are extracted from all tested 3D NAND Flash CWs.



Fig. 4. ECDFs per TLC page type of the of the fail bits count normalized with respect to the ECC capacity in post-retention stress and after the application of the read disturb. The effect of the TRE is also evidenced in the figure.

165 the read operation. From the reliability standpoint, we note 166 that the retention scenario is the most critical to address since 167 the ratio fail bits count/ECC capacity can be greater than one 168 (especially for MSB pages), thus hampering the data recovery 169 operations.

Besides the errors' distribution characterization according to the TLC page type, we analyzed what is the contribution of the topological position (i.e., the layer position in a 3D NAND Tra Flash block) on the fail bits count after the application of the trad disturb. Fig. 5 shows the results of this investigation for the read disturb post-retention test case. We focus on this stress the condition since it is the one triggering the highest number of errors during tests. We can note a large error variability retention read disturb and on top of this, there is a large error



Fig. 5. Fail bits count normalized with respect to the ECC capacity characteristics per TLC page type as a function of layer position in a 3D NAND Flash block after read disturb post-retention. The ECC limit is highlighted for clarity with the black dashed line.

variation between LSB pages and CSB/MSB pages. This is 180 a critical aspect that should be tackled by a statistical model 181 developed to capture errors characteristics. Finally, to better 182 understand the role of the read disturb on the errors count 183 and therefore on the memory reliability we have calculated 184 the error amplification (EA) factor as: 185

$$EA(i) = \frac{ERD(i)}{EPRE(i)} \tag{1}$$

where *i* is the layer position in the block from 0 to N - 1 with <sup>187</sup> N the number of layers exploited in the manufacturing of the <sup>188</sup> 3D NAND Flash chip, *ERD* is the number of errors post-read <sup>189</sup> disturb and *EPRE* the number of errors pre-read disturb. The <sup>190</sup> EA is calculated both for endurance and retention test cases. <sup>191</sup> Even if the endurance case is the one showing the largest EA, <sup>192</sup> we must report once again that such scenario is not critical <sup>193</sup> for the memory reliability since the errors amount remains <sup>194</sup> always well below the ECC limits. On the contrary, even if <sup>195</sup> the EA is below unity for post-retention read disturb, there are <sup>196</sup> some critical conditions on which the errors count is higher <sup>197</sup> than the ECC capacity and are worth to be modeled for future <sup>198</sup> reliability considerations. <sup>199</sup>

### IV. STANDARD STATISTICAL MODELING APPROACH 200

The common procedure adopted for modeling the fail bits <sup>201</sup> count, and in general for all the parametric statistical frameworks applied to 3D NAND Flash data, is to fit the entire <sup>203</sup> ECDF retrieved in a specific working condition. For our <sup>204</sup> study case this is after a read disturb stress performed after <sup>205</sup> endurance or after data retention at high temperature. The <sup>206</sup> advantage of such parametric approach is to achieve a rapid <sup>207</sup> estimation of the ECC capacity to cover errors and proven <sup>208</sup> to be useful in many cases. The statistical modeling of CWs <sup>209</sup> fail bits count distribution bases on the assumption that the <sup>210</sup> corrupted bits in a CW are treated as independent events. By <sup>211</sup> considering a CW length of *n* bytes, it is possible to calculate <sup>212</sup>



Fig. 6. EA factor for read disturb calculated as a function of the layer position in the memory block in endurance (a) and retention (b) test cases.

<sup>213</sup> the probability of having k errors in the CW exhibiting a BER  $_{214}$  p using the binomial distribution probability density function 215 as in [4]:

216 
$$y = P_{error}(k|n, p) = \binom{n}{k} p^k (1-p)^{n-k}$$
 (2)

However, considering that in 3D NAND Flash technology 217 (equal to 4 Kbytes plus spare bits in our work) is relatively 218 n <sup>219</sup> larger than k and p is usually lower than  $2 \times 10^{-2}$ , the bino-<sup>220</sup> mial approach starts to fail. Some alternative distributions like the beta-binomial [28], the Gamma [29], the Gamma-Poisson 221 222 compound [6], or the Weibull [30] have been in considera-223 tion by the literature due to their capability in accounting the 224 intrinsic variability of the memory technology. The easiest to calculate with software tools for numerical analysis are the 225 Gamma and the Weibull distributions. The former is based on 226 the following probability density function: 227

228 
$$y = P_{error}(\lambda | \alpha, \beta) = \frac{\lambda^{\alpha - 1}}{\Gamma(\alpha)\beta^{\alpha}} e^{-\left(\frac{\lambda}{\beta}\right)}$$
(3)

229 where  $\lambda$  is calculated as  $n \cdot p$ ,  $\alpha$  is the shape factor, and  $\beta$  is 230 the scale factor of the distribution. The latter is:

231 
$$y = P_{error}(k|\alpha,\beta) = \frac{\beta}{\alpha} \left(\frac{k}{\alpha}\right)^{\beta-1} e^{-\left(\frac{k}{\alpha}\right)^{\beta}}$$
(4)

232 with  $\alpha$  and  $\beta$  being the shape and the scale factor of the 233 distribution, respectively.

Unfortunately, due to the extreme variability characteristic 234 235 of the fail bits retrieved in different locations of a 3D NAND 236 Flash chip (see Fig. 5), both statical models (Gamma and Weibull) do not pass the  $\chi^2$  goodness-of-fit test (p-value = 0). 237 One may still argue that even if the models do not pass the 238 239 test, they are still valid for predictions of the ECDF distribu-<sup>240</sup> tion tails, whose practical applications are the selection of the 241 ECC capacity to cover errors or the evaluation of the reliabil-242 ity margin. To this extent, we run a cross-validation test using <sup>243</sup> a Holdout methodology where 70% of the CWs tested in the 244 experiments are used for training the statistical models and the remainder 30% are used for testing its prediction accuracy. On 245 a total of 1000 cross-validation splits, none of them passed 246 the goodness-of-fit test. Analyzing the histogram count (see 247 Fig. 7) of the fail bits count/ECC capacity and the resulting 248 fits of the statistical models, we evidence an underestima- 249 tion/overestimation of the empirical data distribution, possibly 250 hampering the selection of the correct ECC capacity or its 251 margin. The Gamma distribution performs better with respect 252 to the Weibull, but still do not pass any statistical test. Same 253 considerations can be drawn by looking at the Cumulative 254 Distribution Function (CDF) modeled by the two statistical 255 models. As in 3D NAND Flash reliability modeling we are 256 mainly interested in the low probability tail of the errors dis- 257 tribution (i.e., extreme events), we need a framework capable 258 to handle only that part of the empirical data. 259

V. THE POT METHOD FOR READ DISTURB EVA 260 261

### A. Introducing the POT-GPD

Motivated by this, we explored a statistical framework 262 related to EVA [18] that is commonly used in tail data anal- 263 ysis, namely the POT. By considering each 3D NAND Flash 264 CW as a sequence of *i.i.d.* measurements  $x_1, x_2, \ldots, x_n$ , we 265 can define as extreme events all the CWs that exceed a defined 266 error threshold u for which we can define an exceedance as: 267

$$\{x_i : x_i \ge u\}.$$
 (5) 268

If the exceedances are labeled as  $x_{(1)}, \ldots, x_{(k)}$ , it is possible 269 to define a threshold excess as: 270

$$y_j = x_{(j)} - u$$
  $j = 1, ..., k.$  (6) 271

From the probability theory it is proven that a random vari- 272 able  $Y_i$  based on the threshold excesses follows a GPD [18]. 273 For a large enough threshold u, we can write its probability 274 density function as: 275

$$f(y) = \sigma^{-1} \left( 1 + \frac{\xi y}{\sigma} \right)^{-1 - \xi^{-1}}$$
(7) 276

with the parameters  $\xi$  and  $\sigma$  being the distribution shape and 277 scale factors, respectively. 278

#### B. Threshold Choice 279

The most critical operation in POT-GPD statistical modeling 280 is the extrapolation of the best threshold to apply. The selec- 281 tion of an optimal threshold within a region of interest (ROI) 282 requires a bias-variance trade-off and a knowledge of the 283 ECC capabilities offered in the 3D NAND Flash data recov- 284 ery processes. If the chosen model threshold is too low, the 285 results are biased because of the model asymptotic assumption 286 being invalid. In other words, a too low threshold will result in 287 having exceedances not converging to the GPD, since the prob- 288 ability distribution is based on its capability of fitting extreme 289 events [31]. On the other hand, if the threshold is too high, the 290 variance is large due to few exceedances. In [32], it is stated 291 that the threshold must be high enough for the exceedances 292 over threshold to converge to the GPD, while the sample size 293 should be large enough to ensure that there are enough data 294



Fig. 7. Gamma and Weibull distributions exploited to fit the fail bits count/ECC capacity distribution on all the CWs measured after read disturb post-retention stress.

<sup>295</sup> points left for satisfactory determination of the GPD param-<sup>296</sup> eters. Additionally, in [33] it is evidenced that the standard <sup>297</sup> practice when choosing a threshold, is to select the lowest <sup>298</sup> threshold possible for which the limit model (i.e., the GPD) <sup>299</sup> provides a reasonable approximation for the exceedances. <sup>300</sup> A method to identify the correct threshold lies in the use of

A method to identify the correct threshold lies in the use of the mean residual life (MRL) plot combined with the stability plots of the GPD parameters. Concerning the former, the locus of points defined as:

304 
$$\left\{ \left( u, \frac{1}{n_u} \sum_{i=1}^{n_u} (x_{(i)} - u) \right) : u < x_{max} \right\}$$
(8)

where  $x_{(1)}, \ldots, x_{(n_u)}$  are the  $n_u$  CWs exceeding the threshold u and  $x_{max}$  is the largest of the  $x_i$ , should be approximately inear in a ROI of u to define a proper threshold. For the latter, it is important to check whether the estimated GPD parameters are stable (i.e., constant) in the ROI, but after the following transformation of the GPD scale parameter [18]:

311

$$\sigma^* = \sigma - \xi u. \tag{9}$$

It is also important to check whether the threshold is mean-312 313 ingful for reliability investigations. To this extent, since all our fail bits count data are normalized with respect to the 314 315 ECC capacity we decided to set the threshold u = 1. Every 316 exceedance will therefore represent an unrecoverable CW 317 and therefore a reliability concern in storage applications. 318 Please note that in our study case we set the ECC capacity 319 matching that offered by state-of-the-art correction engines. 320 Retrospectively, we evaluated that such choice also grants a good number of exceedances where to apply the POT-GPD fit. 321 Fig. 8 shows the validation of the threshold choice proce-322 dure for read disturb post-retention data. Since this scenario 323 324 represents a critical case for the reliability, surely more than the read disturb post-endurance as evidenced in the previous 325 326 sections of the work, we will base all our investigations on 327 this corner.



Fig. 8. (a) Mean residual life plot with a region of interest (ROI) highlighted. (b) and (c) Stability plots of GPD parameters. The CWs data are from postretention read disturb tests [17].

### C. Extending the EVA With POT-Weibull

The POT approach can be complemented with any probability distribution that can embed the concept of threshold. The GPD has been proven as one of the best statistical tools that fits all the modeling excesses problems, although this is not the only one. The Weibull distribution can be another viable approach, but not in the form of eq. (4). Indeed, the threshold concept must be included as in the following:

$$f(\mathbf{y}) = \frac{\beta}{\alpha} \left(\frac{\mathbf{y} - u}{\alpha}\right)^{\beta - 1} e^{-\left(\frac{\mathbf{y} - u}{\alpha}\right)^{\beta}} \tag{10} 336$$

where  $\alpha$  and  $\beta$  are the same parameters defined in eq. (4), <sup>337</sup> and *u* is the threshold defined in the previous section. This <sup>338</sup> distribution is also referred as a three-parameters Weibull [34] <sup>339</sup> that we will use as a benchmark for the GPD. <sup>340</sup>

328



Fig. 9. (a) PDF and (b) CDF of the POT-GPD and POT-Weibull distributions devised in the modeling of the read disturb post-retention CW exceedances over threshold.

TABLE IPOT-GPD AND POT-WEIBULL PARAMETERS ESTIMATE ON READDISTURB POST-RETENTION DATA USING THRESHOLD u = 1

|                           | POT-GPD | POT-Weibull |
|---------------------------|---------|-------------|
| ξ                         | -0.13   | -           |
| $\hat{\sigma}$            | 0.07    | -           |
| $\hat{lpha}$              | -       | 0.07        |
| $\hat{eta}$               | -       | 1.15        |
| p-value ( $\chi^2$ -test) | 0.29    | 0.21        |

### VI. ESTIMATING DIE-LEVEL RELIABILITY

### 342 A. Fitting Process and Model Cross-Validation

After the threshold choice process for read disturb post-343 344 retention data we evaluated the capability of the POT-GPD 345 and POT-Weibull models to fit the exceedances of the CWs 346 error distribution. In Fig. 9, we demonstrate that both the 347 probability density function (PDF) and the CDF obtained 348 through maximum likelihood estimation (MLE) well-fit the 349 experimental data. Both models nicely describes the data. The 350 estimated GPD parameters  $\hat{\xi}$  and  $\hat{\sigma}$  and the Weibull  $\hat{\alpha}$  and are reported in Table I. We run a goodness-of-fit  $\chi^2$  test β 351 <sup>352</sup> with 0.05 confidence level to prove that the exceedances can 353 be described with both POT approaches. The test passed with  $_{354}$  a p-value = 0.29 for POT-GPD and with a p-value = 0.21 for 355 POT-Weibull (the higher the p-value the better it is), so there 356 is no evidence to discard this statistical hypothesis.

### 357 B. Calculating the POT Return Level

We put the POT models at work to predict the die-level reliability of a 3D NAND Flash chip in the read disturb postmeasured by our experimental setup. The goal of this process can be helpful as an example for system designers that fast requires a fast evaluation of the technological capabilities measurements.

The POT-GPD method enables such reliability assessment through the return level evaluation [31], [33]. The return level and the return period are two important concepts in the POT theory, thus requiring proper introduction. If we define a return period *N* of a CW that is measured in quantity of 3D NAND <sup>370</sup> Flash memory blocks, the return level, *x*, is the threshold that <sup>371</sup> is exceeded in one memory block with probability  $\frac{1}{N}$ . This <sup>372</sup> is equivalent to claim that the return level *x* is exceeded on <sup>373</sup> average once in *N* blocks. As an example, a CW with a fail bits <sup>374</sup> count/ECC capacity ratio equal to 1.1 has a return period of 3 <sup>375</sup> blocks if and only if the probability of observing a CW whose <sup>376</sup> fail bits count/ECC capacity ratio higher than 1.1 in a block <sup>377</sup> is  $\frac{1}{3}$ . In the POT theory, the return period is calculated as: <sup>378</sup>

$$N = \underbrace{\frac{\text{number of CWs exceeding the threshold}}_{\text{total number of CWs measured in blocks}} \times m \quad (11) \quad {}_{376}$$

expected number of exceedances in m blocks

From the previous equation, it follows that *N* is the number <sup>380</sup> of events over threshold between the occurrence of two consecutive CWs, both with a return period of *m* blocks. Hence, <sup>382</sup>  $\frac{1}{N}$  (i.e., the return level) is the probability of observing a CW <sup>383</sup> with a return period of *m* blocks in one block. If we choose the <sup>384</sup> CDF *F*(*x*) of a specified probability distribution (i.e., the GPD <sup>385</sup> or the Weibull used in the POT method) and *F*(*x*) =  $1 - \frac{1}{N}$ , <sup>386</sup> then *F*(*x*) is the probability of observing any CW with a fail <sup>387</sup> bits count/ECC capacity ratio less than or equal to *x* in one <sup>388</sup> block. <sup>389</sup>

Starting from that, we assumed that a 3D NAND Flash die 390 is composed by 3000 blocks and then we estimated the return 391 level per block in the case of the GPD distribution as: 392

$$x_m = u + \frac{\hat{\sigma}}{\hat{\xi}} \left[ \left( m \hat{\zeta}_u \right)^{\hat{\xi}} - 1 \right]$$
(12) 393

where *m* is the block number,  $\hat{\zeta}_u$  is the probability to have an <sup>394</sup> exceedance when a threshold *u* is considered, and  $\hat{\sigma}$  and  $\hat{\xi}$  are <sup>395</sup> the estimated parameters of the GPD distribution. In the case <sup>396</sup> of a Weibull distribution the previous return level equation <sup>397</sup> becomes: <sup>398</sup>

$$x_m = u + \hat{\alpha} \left[ log \left( m \hat{\zeta}_u \right)^{1/\hat{\beta}} \right]$$
(13) 396

where  $\hat{\alpha}$  and  $\hat{\beta}$  are the estimated parameters of the three- 400 parameters Weibull distribution, respectively. 401

The results in Fig. 10 for read disturb post-retention mea- 402 surements evidence the return level to be expected for 3000 403 blocks also considering the 95% confidence interval for the 404 GPD and Weibull distributions parameters estimates. From 405 the return level analysis, we infer two results: i) the empir- 406 ical data falls out of the POT-Weibull return level confidence 407 interval for some points; ii) for a high number of blocks, the 408 POT-GPD provides an optimistic estimation of the return level 409 with respect to the POT-Weibull (lower fail bits count/ECC 410 capacity ratio) while providing a larger return level confidence 411 interval. We also run a cross-validation test using a Holdout 412 methodology where 70% of the CWs tested in the experiments 413 are used for training the statistical models and the remain- 414 der 30% are used for testing the POT models. On a total of 415 1000 cross-validation splits we report that the median p-value 416 of the POT-GPD approach is slightly higher than that of the 417 POT-Weibull, justifying the better prediction capabilities of the 418 former model. 419



Fig. 10. Return level estimate for read disturb post-endurance of the POT-GPD and the POT-Weibull model with 95% confidence interval.



Fig. 11. Boxplot of the cross-validation splits performed for POT-GPD and POT-Weibull methods.

All these results clearly indicate that in mass storage application like SSDs and MMCs, where many blocks are considered for high storage capacity, advanced protection conconsidered for high storage capacity, advanced protection conconsidered for high storage capacity, advanced protection conconsidered for high storage capacity, advanced protection contraction must be ensured since it are in highly probable to encounter an unfavorable situation (i.e., unrecoverable errors) in some of the blocks constituting the memory die. This requires additional effort at system level to mitigate the 3D NAND Flash error probability.

### 428 C. Bootstrapping the POT Estimates

Since the POT-GPD and the POT-Weibull parameters are obtained through an MLE process we had to retrieve their confidence interval through a bootstrap analysis of the parameters with 1000 replica of the exceedances' dataset. As ingle bootstrap replica is a random sample of size  $n_u$ defined as  $(x_1^*, x_2^*, \dots, x_{nu}^*)$  drawn with replacement from the exceedances population of  $n_u$  samples retrieved with the procedure described in the former section of this work. In this data set, some appearing zero times, some appearing once or multiple times. Fig. 12 shows a quantile-quantile plot proving an ormal distribution of the POT-GPD parameters on which



Fig. 12. Bootstrap simulation on the GPD parameters by resampling 1000 times the exceedances CW in the read disturb post-retention dataset [17].



Fig. 13. Standard deviation in return level estimates depending on the chosen resampling technique. Solid lines are read disturb post-endurance data whereas dashed lines are post-retention.

it is easy to extract the confidence interval. Similar results <sup>441</sup> (not shown) are achieved for the POT-Weibull. Nevertheless, <sup>442</sup> we must report that this procedure has some issues in the <sup>443</sup> lower quantiles of the normal distribution. This is ascribed <sup>444</sup> to the MLE process convergence to a boundary point of the <sup>445</sup> parameters space for some bootstrap samples. <sup>446</sup>

Finally, we tried another resampling technique to check if  $_{447}$  we would achieve consistent results in the POT-GPD and  $_{448}$  POT-Weibull parameters estimation, namely the Jack-knife  $_{449}$  resampling. In this technique, if the original dataset of  $n_u$   $_{450}$  exceedances is employed, the i - th jack-knife sample is  $_{451}$  defined as:

$$x_{(i)} = (x_1; \ldots; x_{i-1}; x_{i+1}; \ldots; x_{nu})$$
  $i = 1; \ldots; n_u.$  (14) 453

A calculation of this method has been performed with a 454 commercial tool for matrix data manipulation. To compare the 455 prediction accuracy for both resampling technique we plotted 456 the standard deviation of the return level predictions as a func- 457 tion of the return level calculated with (12). As we can see 458 in Fig. 13, the jack-knife resampling provides the smallest 459

481

460 standard deviation for estimates (there is a difference up to 461 40 times at die level prediction) performed for read disturb 462 post-retention. This result is attributed to a small variation of 463 the new generated samples (the replica datasets differ for a <sup>464</sup> single value). To this extent, Jack-knife resampling technique <sup>465</sup> is not well suited to be used together with the POT approach, 466 since generated replicas are not so different, hence, estima-<sup>467</sup> tions based on these samples differ slightly and could lead to 468 optimistic predictions.

### VII. CONCLUSION

In this work, we validated the POT methodology as a tech-470 471 nique for EVA to be applied on a study case like the read 472 disturb reliability modeling in 3D NAND Flash memories. 473 The effectiveness of the model proven its applicability in 474 die level reliability predictions of the number of errors per 475 CW in an important scenario like the post-retention use case. 476 The methodology could be beneficial for storage system level 477 designers dealing with error mitigation schemes. In future, we 478 plan to apply the methodology to consider other 3D NAND 479 Flash reliability threats and to model extreme events in SSD 480 platforms studied at architectural level.

### REFERENCES

- [1] R. Micheloni, S. Aritome, and L. Crippa, "Array architectures for 3-D 482 NAND flash memories," Proc. IEEE, vol. 105, no. 9, pp. 1634-1649, 483 484 Sep. 2017, doi: 10.1109/JPROC.2017.2697000.
- A. S. Spinelli, C. M. Compagnoni, and A. L. Lacaita, "Reliability of [2] 485 NAND flash memories: Planar cells and emerging issues in 3D devices. 486 487 Computers, vol. 6, no. 2, p. 16, 2017, doi: 10.3390/computers6020016.
- T. A. Marquart, "Solid-state-drive qualification and reliability strategy," [3] 488 in Proc. IEEE Int. Integr. Rel. Workshop (IIRW), South Lake Tahoe, CA, 489 USA, Oct. 2015, pp. 3-6, doi: 10.1109/IIRW.2015.7437056. 490
- [4] N. Mielke et al., "Bit error rate in NAND flash memories," in Proc. 491 492 IEEE Int. Rel. Phys. Symp., Phoenix, AZ, USA, Apr. 2008, pp. 9-19, doi: 10.1109/RELPHY.2008.4558857. 493
- T. Parnell, N. Papandreou, T. Mittelholzer, and H. Pozidis, "Modelling of [5] 494 the threshold voltage distributions of sub-20nm NAND flash memory, 495 in Proc. IEEE Global Commun. Conf., Austin, TX, USA, Dec. 2014, 496 pp. 2351-2356, doi: 10.1109/GLOCOM.2014.7037159. 497
- N.-J. Wang et al., "Statistical analysis of bit-errors distribution [6] 498 for reliability of 3-D NAND flash memories," in Proc. IEEE Int. 499 Rel. Phys. Symp. (IRPS), Dallas, TX, USA, Apr. 2020, pp. 1-5, 500 doi: 10.1109/IRPS45951.2020.9128993. 501
- Zambelli, R. Micheloni, and P. Olivo, "Reliability chal-502 [7] C. lenges in 3D NAND flash memories," in Proc. IEEE 11th Int. 503 Memory Workshop (IMW), Monterey, CA, USA, May 2019, pp. 1-4, 504 doi: 10.1109/IMW.2019.8739741. 505
- L. Zuolo, C. Zambelli, R. Micheloni, and P. Olivo, "Solid-[8] 506 state drives: Memory driven design methodologies for optimal 507 performance," Proc. IEEE, vol. 105, no. 9, pp. 1589-1608, Sep. 2017, 508 doi: 10.1109/JPROC.2017.2733621. 509
- T. Zhang, Using LDPC Codes in SSD-Challenges and Solutions, Flash 510 [9] Memory Summit, Santa Clara, CA, USA, Aug. 2012. 511
- 512 [10] E. F. Haratsch, LDPC Code Concepts and Performance on High-Density Flash Memory, Flash Memory Summit, Santa Clara, CA, USA, 513 Aug. 2014. 514
- N. R. Mielke, R. E. Frickey, I. Kalastirsky, M. Quan, D. Ustinov, 515 [11] and V. J. Vasudevan, "Reliability of solid-state drives based on NAND 516 flash memory," Proc. IEEE, vol. 105, no. 9, pp. 1725-1750, Sep. 2017, 517 doi: 10.1109/JPROC.2017.2725738. 518
- S. Im and D. Shin, "Flash-aware RAID techniques for dependable and 519 [12] high-performance flash memory SSD," IEEE Trans. Comput., vol. 60, 520 no. 1, pp. 80-92, Jan. 2011, doi: 10.1109/TC.2010.197. 521
- [13] J. Kim, E. Lee, J. Choi, D. Lee, and S. H. Noh, "Chip-level RAID 522 with flexible stripe size and parity placement for enhanced SSD relia-523
- bility," IEEE Trans. Comput., vol. 65, no. 4, pp. 1116-1130, Apr. 2016, 524 525 doi: 10.1109/TC.2014.2375179.

- [14] Y. Li, P. P. C. Lee, and J. C. S. Lui, "Analysis of reliability dynamics 526 of SSD RAID," IEEE Trans. Comput., vol. 65, no. 4, pp. 1131-1144, 527 Apr. 2016, doi: 10.1109/TC.2014.2349505. 528
- [15] C. Zambelli, A. Marelli, R. Micheloni, and P. Olivo, "Modeling the 529 endurance reliability of intradisk RAID solutions for mid-1X TLC 530 NAND flash solid-state drives," IEEE Trans. Device Mater. Rel., vol. 17, 531 no. 4, pp. 713-721, Dec. 2017, doi: 10.1109/TDMR.2017.2749639. 532
- [16] A. Grossi, L. Zuolo, F. Restuccia, C. Zambelli, and P. Olivo, 533 "Quality-of-service implications of enhanced program algorithms for 534 charge-trapping NAND in future solid-state drives," IEEE Trans. 535 Device Mater. Rel., vol. 15, no. 3, pp. 363-369, Sep. 2015, 536 doi: 10.1109/TDMR.2015.2448108. 537
- [17] C. Zambelli, L. Crippa, R. Micheloni, and P. Olivo, "Points-over-538 threshold statistics for post-retention read disturb reliability in 3D NAND 539 flash," in Proc. IEEE Int. Integr. Rel. Workshop (IIRW), South Lake 540 Tahoe, CA, USA, 2020, pp. 1-5. 541
- [18] M. R. Leadbetter, "On a basis for 'peaks over threshold' 542 modeling," Stat. Probab. Lett., vol. 12, no. 4, pp. 357-362, 1991, 543 doi: 10.1016/0167-7152(91)90107-3. 544
- [19] K. Ha, J. Jeong, and J. Kim, "A read-disturb management technique for 545 high-density NAND flash memory," in Proc. 4th Asia-Pac. Workshop 546 Syst., 2013, pp. 1-6, doi: 10.1145/2500727.2500743. 547
- [20] C. Zambelli et al., "Characterization of TLC 3D-NAND flash endurance 548 through machine learning for LDPC code rate optimization," in Proc. 549 IEEE Int. Memory Workshop (IMW), Monterey, CA, USA, 2017, 550 pp. 1-4, doi: 10.1109/IMW.2017.7939074. 551
- [21] Electrically Erasable Programmable Rom (EEPROM) Program/Erase 552 Endurance and Data Retention Test document JESD22-A117, JEDEC, 553 Arlington, VA, USA, Oct. 2018. 554
- [22] C. Zambelli, P. Olivo, L. Crippa, A. Marelli, and R. Micheloni, 555 "Uniform and concentrated read disturb effects in mid-1X TLC NAND 556 flash memories for enterprise solid state drives," in Proc. IEEE 557 Int. Rel. Phys. Symp. (IRPS), Monterey, CA, USA, 2017, pp. 1-4, 558 doi: 10.1109/IRPS.2017.7936387. 559
- C. Zambelli, R. Micheloni, S. Scommegna, and P. Olivo, "First evidence [23] 560 of temporary read errors in TLC 3D-NAND flash memories exiting from 561 an idle state," IEEE J. Electron Devices Soc., vol. 8, pp. 99-104, 2020, 562 doi: 10.1109/JEDS.2020.2965648. 563
- (2019). PM8609 NVMe2032 [24] Microsemi Flashtec NVMe 564 Controller. [Online]. Available: https://www.microsemi.com/product-565 directory/storage-ics/3687-flashtec-nvme-controllers 566
- [25] N. Papandreou et al., "Characterization and analysis of bit errors in 3D 567 TLC NAND flash memory," in Proc. IEEE Int. Rel. Phys. Symp. (IRPS), 568 Monterey, CA, USA, 2019, pp. 1-6, doi: 10.1109/IRPS.2019.8720454. 569
- [26] F. Wang et al., "Lateral charge migration induced abnormal read disturb 570 in 3D charge-trapping NAND flash memory," Appl. Phys. Exp., vol. 13, 571 no. 5, Apr. 2020, Art. no. 054002, doi: 10.35848/1882-0786/ab8729. 572
- Y. Cai, Y. Luo, S. Ghose, and O. Mutlu, "Read disturb errors in MLC [27] 573 NAND flash memory: Characterization, mitigation, and recovery," in 574 Proc. IEEE/IFIP Int. Conf. Depend. Syst. Netw., Rio de Janeiro, Brazil, 575 Jun. 2015, pp. 438-449, doi: 10.1109/DSN.2015.49. 576
- [28] V. Taranalli, H. Uchikawa, and P. H. Siegel, "On the capacity of the 577 beta-binomial channel model for multi-level cell flash memories," IEEE 578 J. Sel. Areas Commun., vol. 34, no. 9, pp. 2312-2324, Sep. 2016, 579 doi: 10.1109/JSAC.2016.2603660. 580
- [29] Y. Luo, S. Ghose, Y. Cai, E. F. Haratsch, and O. Mutlu, 581 "Improving 3D NAND flash memory lifetime by tolerating early 582 retention loss and process variation," 2018. [Online]. Available: 583 arXiv:1807.05140 584
- Y. Cai, E. F. Haratsch, O. Mutlu, and K. Mai, "Threshold voltage dis-[30] 585 tribution in MLC NAND flash memory: Characterization, analysis, and 586 modeling," in Proc. Design Autom. Test Eur. Conf. Exhibit. (DATE), 587 Grenoble, France, 2013, pp. 1285–1290, doi: 10.7873/DATE.2013.266. 588
- C. Stander, "Analysis of extreme events in the coastal engineer- 589 [31] ing environment," M.S. thesis, Dept. Appl. Math. Stellenbosch Univ., 590 Stellenbosch, South Africa, Dec. 2015. 591
- [32] N. Teena, V. S. Kumar, K. Sudheesh, and R. Sajeev, "Statistical anal- 592 ysis on extreme wave height," Nat. Hazards J. Int. Soc. Prevent. 593 Mitigation Nat. Hazards, vol. 64, no. 1, pp. 223-236, Oct. 2012, 594 doi: 10.1007/s11069-012-0229-v. 595
- S. Coles, An Introduction to Statistical Modeling of Extreme Values. 596 [33] AO3 London, U.K.: Springer, 2001. 597
- E. Merran, N. Hastings, and B. Peacock, Statistical Distributions, 598 [34] 2nd ed. New York, NY, USA: Wiley, 1993. 599

## AUTHOR QUERIES AUTHOR PLEASE ANSWER ALL QUERIES

PLEASE NOTE: We cannot accept new source files as corrections for your paper. If possible, please annotate the PDF proof we have sent you with your corrections and upload it via the Author Gateway. Alternatively, you may send us your corrections in list format. You may also upload revised graphics via the Author Gateway.

Carefully check the page proofs (and coordinate with all authors); additional changes or updates WILL NOT be accepted after the article is published online/print in its final form. Please check author names and affiliations, funding, as well as the overall article for any errors prior to sending in your author proof corrections. Your article has been peer reviewed, accepted as final, and sent in to IEEE. No text changes have been made to the main part of the article as dictated by the editorial level of service for your publication.

AQ1: According to our records, Luca Crippa is listed as a Member, IEEE. Please verify.

- AQ2: Please confirm or add details for any funding or financial support for the research of this article.
- AQ3: Please confirm if the location and publisher information for Reference [33] is correct as set.

# Investigating 3D NAND Flash Read Disturb Reliability With Extreme Value Analysis

Cristian Zambelli<sup>®</sup>, Member, IEEE, Luca Crippa, Member, IEEE, Rino Micheloni, Senior Member, IEEE,

and Piero Olivo

*Abstract*—The storage systems relying on the 3D NAND Flash technology require an extensive modeling of their reliability in different working corners. This enables the deployment of system level management routines that do not compromise the overall performance and reliability of the system itself. Dedicated parametric statistical models have been developed so far to capture the evolution of the memory reliability, although limiting the description to an average behavior rather than extreme cases that can disrupt the storage functionality. In this work, we validate the application of an extreme statistics tool, namely the Points-Over-Threshold method, to characterize the read disturb reliability of a 3D NAND Flash chip. Such technique proved that the die reliability characterized through extreme events analysis can be predicted using a low number of samples and generally holds good prediction features for distribution tail events.

Index Terms—3D-TLC NAND flash, read disturb, reliability,
points over threshold.

I. INTRODUCTION

ODELLING the reliability of the 3D NAND Flash 19 memory technology [1], [2] is still an important task 20 21 to be performed as a support for storage system designers 22 dedicated to Solid State Drives (SSDs) or Multi Media Card 23 (MMC) products development. Indeed, all the firmware solu-24 tions implemented in their controllers (i.e., the computing core 25 of SSDs and MMCs) whose goal is mitigating the inherent bit <sup>26</sup> error rate (BER) exposed in different storage working condi-27 tions (e.g., endurance stress, data retention at high temperature, 28 etc.) are well founded on memory reliability models [3]. To <sup>29</sup> this extent, dedicated parametric statistical models [4]–[6] have 30 been developed so far to capture the evolution of the memory's 31 errors distribution through well-known statistical frameworks 32 (defined as probability distributions) like Gaussian, Binomial, 33 Poisson, Gamma, and so on. However, the large process-34 induced variability of the error characteristics in 3D NAND 35 Flash devices [7] combined with an intrinsic difficulty in test-36 ing all the possible permutations of the memory working 37 corners during lifetime and on a relevant statistical population,

Manuscript received August 2, 2021; accepted August 27, 2021. (Corresponding author: Cristian Zambelli.)

Cristian Zambelli and Piero Olivo are with the Dipartimento di Ingegneria, Universita degli Studi di Ferrara, 44122 Ferrara, Italy (e-mail: cristian.zambelli@unife.it).

Luca Crippa and Rino Micheloni are with the Flash Signal Processing Labs, Microchip Corporation, 20871 Vimercate, Italy.

Color versions of one or more figures in this article are available at https://doi.org/10.1109/TDMR.2021.3108941.

Digital Object Identifier 10.1109/TDMR.2021.3108941

could hamper an accurate description of the distribution upper tail. It is worth to mention that on this part of the errors distribution, the storage system designers spend a significant effort to tailor the Error Correction Codes (ECCs) [8] strength and the secondary correction schemes like soft decoding [9], [10], Moving Read References [11], and even RAID [12]–[15]. Therefore, the more precise and accurate is the model the less is the probability to incur in storage performance slowdowns due to improperly calibrated error correction techniques [16].

In [17], we proposed for the first time to apply a parametric 47 model commonly exploited in extreme value analysis (EVA) 48 for natural sciences and econometrics to 3D NAND Flash 49 errors distribution upper tail modeling, namely the Point Over 50 Threshold (POT) method [18]. The work demonstrated the 51 potential of this methodology combined with the Generalized 52 Pareto Distribution (GPD) in the analysis of the read disturb 53 stress after data retention. We picked that memory working 54 condition since it is known to represent a critical use case in 55 data center applications for Big Data analytics performed on 56 cold data [19]. Starting from our previous work, we extended 57 the study in twofold directions: the first related to the char-58 acterization and estimation of the read disturb also for hot 59 data scenario (i.e., read after many updates) mimicked by an 60 endurance stress, showing that it is not critical for the reli-61 ability as in retention conditions; the second concerning the cross-validation of the POT and the extension of its implemen-63 tation with a three-parameters Weibull distribution. This will 64 demonstrate, with a proper confidence, that POT is a powerful 65 statistical tool to estimate the 3D NAND Flash die reliability. 66

### II. EXPERIMENTAL SETUP

### A. Devices Under Test

The experimental activity in this work is based on the char-69 acterization of an off-the-shelf sub-100 layers 3D NAND Flash 70 memory product implementing the Triple Level Cell (TLC) 71 paradigm (see Fig. 1). Such technology is considered, based on 72 its endurance and retention rating, as a mass storage medium 73 for enterprise SSD applications. The statistical sample under 74 investigation is composed by all the pages in every physical 75 layer of 40 memory blocks distributed on multiple dies and 76 chips to account for process-induced variability [7]. Since we 77 are testing a TLC memory, we consider all the page types in 78 the analysis (i.e., LSB-Least Significant Bit page, CSB-Center Significant Bit page, and MSB-Most Significant Bit page). 80 Each page is sized 16 Kbytes plus the spare bytes exploited for 81

1530-4388 © 2021 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission.

See https://www.ieee.org/publications/rights/index.html for more information.

1

67

68

AQ1

18

AQ2



Fig. 1. TLC 3D NAND Flash architecture considered in this work [17].

<sup>82</sup> data recovery purposes in case of data corruption. However, <sup>83</sup> state-of-the-art ECCs implementations [8] work on subset of <sup>84</sup> the page dimension, generally referred as a codeword (CW). <sup>85</sup> In this work, the CW size is 4 Kbytes plus the spare bits, so <sup>86</sup> that every page is constituted by 4 CWs. The statistical sample <sup>87</sup> under test is constituted by 184320 CWs.

### 88 B. Test Flow for Reliability Assessments

The execution of the test flows required to extract the exper-89 <sup>90</sup> imental data to fit with the statistical models presented in the <sup>91</sup> paper is performed by the automated test equipment (ATE) <sup>92</sup> presented in [20]. The system interfaces with the 3D NAND 93 Flash chips at a 400 MT/s data rate and allows, for any applied 94 test, to measure the number of corrupted bits (also known 95 as fail bits count or errors number) in each CW. We remind <sup>96</sup> that a bit is considered corrupted after reading from the 3D 97 NAND Flash under test if its value changes from what has <sup>98</sup> been previously written as a result of a reliability degrada-<sup>99</sup> tion process. Fig. 2 summarizes the test procedure performed. 100 After the definition of the statistical sample under test, we 101 write a random pattern on all the TLC pages (therefore on 102 all the CWs) to rule out any topological dependency of the 103 corrupted bits and perform a readout of the memory content. 104 Then, the devices are submitted to an endurance stress up to 105 the rated endurance of the technology (i.e., maximum num-106 ber of sustainable block erase before unrecoverable errors) 107 using a JEDEC-based cycling test [21]. The test consists in 108 3000 Program/Erase cycles at a 61 °C temperature for 500 109 hours. After the stress we perform a readout of all the CWs 110 under test and we extract the number of corrupted bits for 111 each CW. A read disturb is then performed post-endurance by 112 employing a 1000 uniform block reads access pattern [22] on 113 all tested blocks and performed another readout for fail bits 114 count extraction. Immediately after the end of the read disturb <sup>115</sup> post-endurance stress, a data retention stress is performed by <sup>116</sup> placing all the devices under test in idle for 90 days at a 40 °C <sup>117</sup> temperature. A double readout is performed at the end of the 118 test to separate the Temporary Read Errors (TRE) effect typ-119 ical of 3D NAND Flash architectures [23] from the retention 120 stress results. Finally, an additional read disturb post-retention 121 is performed and the fail bits count per CW are extracted 122 accordingly.



Fig. 2. Depiction of the test flow adopted in this work for the 3D NAND Flash reliability characterization.

### III. 3D NAND FLASH READ DISTURB CHARACTERIZATION

123

124

We started the read disturb characterization on our devices 125 by evaluating the Empirical Cumulative Distribution Function 126 (ECDF) of the fail bits count extracted on all the CWs before 127 and after the read disturb stress post-endurance and post- 128 retention scenarios, as described in Fig. 2. For a given number 129 of fail bits equal to t in a sample x, the ECDF is defined as 130the proportion of the values in  $x \leq t$ . We prefer to use the 131 term ECDF rather than CDF since the latter one can be mis- 132 leading as it is usually referenced to a theoretical probability 133 distribution used to fit the experimental data, which is not our 134 case here. Unfortunately, due to confidentiality reasons on the 135 tested 3D NAND Flash samples we cannot disclose the ECDF 136 of the fail bits count, but we have to normalize the number 137 of corrupted bits on a CW to a defined entity. In this work, 138 we normalized the fail bits count with respect to the ECC 139 capacity offered by an advanced correction engine that incor- 140 porates secondary error correction schemes (e.g., read retry 141 and soft-decoding) as well [9], [10], [24]. 142

Fig. 3 shows that after endurance stress the read disturb generally increases the number of errors as expected from other studies in this context [25], [26]. This behavior is related to an over-programming of the memory cells that are not involved by the read operation due a moderate voltage applied to unselect them [27]. Looking at the median of the ECDFs per page type it is observed that the CWs belonging to MSB pages are those displaying the largest errors increase, even if there are some CWs in LSB pages (those in the associated ECDF tail) that are largely affected. However, the read disturb post-endurance is not particularly detrimental for the reliability since the ratio fail bits count/ECC capacity is well below one and displays a sufficient margin for safe operation without data corruption.

Fig. 4 shows the results of the same analysis replicated for 157 the retention domain. In the errors analysis we also reported 158 the ECDFs retrieved on the first readout of the memory blocks 159 under test to evaluate the impact of the TRE, as discussed 160 in the Section II of this work. We interestingly observe that 161 the read disturb applied post-retention can recover part of the 162 errors in a CW for all TLC page types. This has been explained 163 in [26], by a charge redistribution mechanism occurring during 164



Fig. 3. ECDFs per TLC page type of the fail bits count normalized with respect to the ECC capacity in post-endurance stress and after the application of the read disturb. The ECDFs are extracted from all tested 3D NAND Flash CWs.



Fig. 4. ECDFs per TLC page type of the of the fail bits count normalized with respect to the ECC capacity in post-retention stress and after the application of the read disturb. The effect of the TRE is also evidenced in the figure.

165 the read operation. From the reliability standpoint, we note 166 that the retention scenario is the most critical to address since 167 the ratio fail bits count/ECC capacity can be greater than one 168 (especially for MSB pages), thus hampering the data recovery 169 operations.

Besides the errors' distribution characterization according to the TLC page type, we analyzed what is the contribution of the topological position (i.e., the layer position in a 3D NAND Tra Flash block) on the fail bits count after the application of the trad disturb. Fig. 5 shows the results of this investigation for the read disturb post-retention test case. We focus on this stress the condition since it is the one triggering the highest number trad among layers in a single 3D NAND Flash block after the posttrad disturb and on top of this, there is a large error



Fig. 5. Fail bits count normalized with respect to the ECC capacity characteristics per TLC page type as a function of layer position in a 3D NAND Flash block after read disturb post-retention. The ECC limit is highlighted for clarity with the black dashed line.

variation between LSB pages and CSB/MSB pages. This is 180 a critical aspect that should be tackled by a statistical model 181 developed to capture errors characteristics. Finally, to better 182 understand the role of the read disturb on the errors count 183 and therefore on the memory reliability we have calculated 184 the error amplification (EA) factor as: 185

$$EA(i) = \frac{ERD(i)}{EPRE(i)} \tag{1}$$

where *i* is the layer position in the block from 0 to N - 1 with <sup>187</sup> N the number of layers exploited in the manufacturing of the <sup>188</sup> 3D NAND Flash chip, *ERD* is the number of errors post-read <sup>189</sup> disturb and *EPRE* the number of errors pre-read disturb. The <sup>190</sup> EA is calculated both for endurance and retention test cases. <sup>191</sup> Even if the endurance case is the one showing the largest EA, <sup>192</sup> we must report once again that such scenario is not critical <sup>193</sup> for the memory reliability since the errors amount remains <sup>194</sup> always well below the ECC limits. On the contrary, even if <sup>195</sup> the EA is below unity for post-retention read disturb, there are <sup>196</sup> some critical conditions on which the errors count is higher <sup>197</sup> than the ECC capacity and are worth to be modeled for future <sup>198</sup> reliability considerations. <sup>199</sup>

### IV. STANDARD STATISTICAL MODELING APPROACH 200

The common procedure adopted for modeling the fail bits <sup>201</sup> count, and in general for all the parametric statistical frame-<sup>202</sup> works applied to 3D NAND Flash data, is to fit the entire <sup>203</sup> ECDF retrieved in a specific working condition. For our <sup>204</sup> study case this is after a read disturb stress performed after <sup>205</sup> endurance or after data retention at high temperature. The <sup>206</sup> advantage of such parametric approach is to achieve a rapid <sup>207</sup> estimation of the ECC capacity to cover errors and proven <sup>208</sup> to be useful in many cases. The statistical modeling of CWs <sup>209</sup> fail bits count distribution bases on the assumption that the <sup>210</sup> corrupted bits in a CW are treated as independent events. By <sup>211</sup> considering a CW length of *n* bytes, it is possible to calculate <sup>212</sup>



Fig. 6. EA factor for read disturb calculated as a function of the layer position in the memory block in endurance (a) and retention (b) test cases.

<sup>213</sup> the probability of having k errors in the CW exhibiting a BER  $_{214}$  p using the binomial distribution probability density function 215 as in [4]:

216 
$$y = P_{error}(k|n, p) = \binom{n}{k} p^k (1-p)^{n-k}$$
 (2)

However, considering that in 3D NAND Flash technology 217 (equal to 4 Kbytes plus spare bits in our work) is relatively 218 n <sup>219</sup> larger than k and p is usually lower than  $2 \times 10^{-2}$ , the bino-<sup>220</sup> mial approach starts to fail. Some alternative distributions like the beta-binomial [28], the Gamma [29], the Gamma-Poisson 221 222 compound [6], or the Weibull [30] have been in considera-223 tion by the literature due to their capability in accounting the 224 intrinsic variability of the memory technology. The easiest to calculate with software tools for numerical analysis are the 225 Gamma and the Weibull distributions. The former is based on 226 the following probability density function: 227

228 
$$y = P_{error}(\lambda | \alpha, \beta) = \frac{\lambda^{\alpha - 1}}{\Gamma(\alpha)\beta^{\alpha}} e^{-\left(\frac{\lambda}{\beta}\right)}$$
(3)

229 where  $\lambda$  is calculated as  $n \cdot p$ ,  $\alpha$  is the shape factor, and  $\beta$  is 230 the scale factor of the distribution. The latter is:

231 
$$y = P_{error}(k|\alpha,\beta) = \frac{\beta}{\alpha} \left(\frac{k}{\alpha}\right)^{\beta-1} e^{-\left(\frac{k}{\alpha}\right)^{\beta}}$$
(4)

232 with  $\alpha$  and  $\beta$  being the shape and the scale factor of the 233 distribution, respectively.

Unfortunately, due to the extreme variability characteristic 234 235 of the fail bits retrieved in different locations of a 3D NAND 236 Flash chip (see Fig. 5), both statical models (Gamma and Weibull) do not pass the  $\chi^2$  goodness-of-fit test (p-value = 0). 237 One may still argue that even if the models do not pass the 238 239 test, they are still valid for predictions of the ECDF distribu-<sup>240</sup> tion tails, whose practical applications are the selection of the 241 ECC capacity to cover errors or the evaluation of the reliabil-242 ity margin. To this extent, we run a cross-validation test using <sup>243</sup> a Holdout methodology where 70% of the CWs tested in the 244 experiments are used for training the statistical models and the remainder 30% are used for testing its prediction accuracy. On 245 a total of 1000 cross-validation splits, none of them passed 246 the goodness-of-fit test. Analyzing the histogram count (see 247 Fig. 7) of the fail bits count/ECC capacity and the resulting 248 fits of the statistical models, we evidence an underestima- 249 tion/overestimation of the empirical data distribution, possibly 250 hampering the selection of the correct ECC capacity or its 251 margin. The Gamma distribution performs better with respect 252 to the Weibull, but still do not pass any statistical test. Same 253 considerations can be drawn by looking at the Cumulative 254 Distribution Function (CDF) modeled by the two statistical 255 models. As in 3D NAND Flash reliability modeling we are 256 mainly interested in the low probability tail of the errors dis- 257 tribution (i.e., extreme events), we need a framework capable 258 to handle only that part of the empirical data. 259

V. THE POT METHOD FOR READ DISTURB EVA 260 261

### A. Introducing the POT-GPD

Motivated by this, we explored a statistical framework 262 related to EVA [18] that is commonly used in tail data anal- 263 ysis, namely the POT. By considering each 3D NAND Flash 264 CW as a sequence of *i.i.d.* measurements  $x_1, x_2, \ldots, x_n$ , we 265 can define as extreme events all the CWs that exceed a defined 266 error threshold u for which we can define an exceedance as: 267

$$\{x_i : x_i \ge u\}.$$
 (5) 268

If the exceedances are labeled as  $x_{(1)}, \ldots, x_{(k)}$ , it is possible 269 to define a threshold excess as: 270

$$y_j = x_{(j)} - u$$
  $j = 1, ..., k.$  (6) 271

From the probability theory it is proven that a random vari- 272 able  $Y_i$  based on the threshold excesses follows a GPD [18]. 273 For a large enough threshold u, we can write its probability 274 density function as: 275

$$f(y) = \sigma^{-1} \left( 1 + \frac{\xi y}{\sigma} \right)^{-1 - \xi^{-1}}$$
(7) 276

with the parameters  $\xi$  and  $\sigma$  being the distribution shape and 277 scale factors, respectively. 278

#### B. Threshold Choice 279

The most critical operation in POT-GPD statistical modeling 280 is the extrapolation of the best threshold to apply. The selec- 281 tion of an optimal threshold within a region of interest (ROI) 282 requires a bias-variance trade-off and a knowledge of the 283 ECC capabilities offered in the 3D NAND Flash data recov- 284 ery processes. If the chosen model threshold is too low, the 285 results are biased because of the model asymptotic assumption 286 being invalid. In other words, a too low threshold will result in 287 having exceedances not converging to the GPD, since the prob- 288 ability distribution is based on its capability of fitting extreme 289 events [31]. On the other hand, if the threshold is too high, the 290 variance is large due to few exceedances. In [32], it is stated 291 that the threshold must be high enough for the exceedances 292 over threshold to converge to the GPD, while the sample size 293 should be large enough to ensure that there are enough data 294



Fig. 7. Gamma and Weibull distributions exploited to fit the fail bits count/ECC capacity distribution on all the CWs measured after read disturb post-retention stress.

<sup>295</sup> points left for satisfactory determination of the GPD param-<sup>296</sup> eters. Additionally, in [33] it is evidenced that the standard <sup>297</sup> practice when choosing a threshold, is to select the lowest <sup>298</sup> threshold possible for which the limit model (i.e., the GPD) <sup>299</sup> provides a reasonable approximation for the exceedances. <sup>300</sup> A method to identify the correct threshold lies in the use of

A method to identify the correct threshold lies in the use of the mean residual life (MRL) plot combined with the stability plots of the GPD parameters. Concerning the former, the locus of points defined as:

304 
$$\left\{ \left( u, \frac{1}{n_u} \sum_{i=1}^{n_u} (x_{(i)} - u) \right) : u < x_{max} \right\}$$
(8)

where  $x_{(1)}, \ldots, x_{(n_u)}$  are the  $n_u$  CWs exceeding the threshold u and  $x_{max}$  is the largest of the  $x_i$ , should be approximately inear in a ROI of u to define a proper threshold. For the latter, it is important to check whether the estimated GPD parameters are stable (i.e., constant) in the ROI, but after the following transformation of the GPD scale parameter [18]:

311

$$\sigma^* = \sigma - \xi u. \tag{9}$$

It is also important to check whether the threshold is mean-312 313 ingful for reliability investigations. To this extent, since all our fail bits count data are normalized with respect to the 314 315 ECC capacity we decided to set the threshold u = 1. Every 316 exceedance will therefore represent an unrecoverable CW 317 and therefore a reliability concern in storage applications. 318 Please note that in our study case we set the ECC capacity 319 matching that offered by state-of-the-art correction engines. 320 Retrospectively, we evaluated that such choice also grants a good number of exceedances where to apply the POT-GPD fit. 321 Fig. 8 shows the validation of the threshold choice proce-322 dure for read disturb post-retention data. Since this scenario 323 324 represents a critical case for the reliability, surely more than the read disturb post-endurance as evidenced in the previous 325 326 sections of the work, we will base all our investigations on 327 this corner.



Fig. 8. (a) Mean residual life plot with a region of interest (ROI) highlighted. (b) and (c) Stability plots of GPD parameters. The CWs data are from postretention read disturb tests [17].

### C. Extending the EVA With POT-Weibull

The POT approach can be complemented with any probability distribution that can embed the concept of threshold. The GPD has been proven as one of the best statistical tools that fits all the modeling excesses problems, although this is not the only one. The Weibull distribution can be another viable approach, but not in the form of eq. (4). Indeed, the threshold concept must be included as in the following:

$$f(\mathbf{y}) = \frac{\beta}{\alpha} \left(\frac{\mathbf{y} - u}{\alpha}\right)^{\beta - 1} e^{-\left(\frac{\mathbf{y} - u}{\alpha}\right)^{\beta}} \tag{10} 336$$

where  $\alpha$  and  $\beta$  are the same parameters defined in eq. (4), <sup>337</sup> and *u* is the threshold defined in the previous section. This <sup>338</sup> distribution is also referred as a three-parameters Weibull [34] <sup>339</sup> that we will use as a benchmark for the GPD. <sup>340</sup>

328



Fig. 9. (a) PDF and (b) CDF of the POT-GPD and POT-Weibull distributions devised in the modeling of the read disturb post-retention CW exceedances over threshold.

TABLE IPOT-GPD AND POT-WEIBULL PARAMETERS ESTIMATE ON READDISTURB POST-RETENTION DATA USING THRESHOLD u = 1

|                           | POT-GPD | POT-Weibull |
|---------------------------|---------|-------------|
| Ê                         | -0.13   | -           |
| $\hat{\sigma}$            | 0.07    | -           |
| $\hat{\alpha}$            | -       | 0.07        |
| $\hat{eta}$               | -       | 1.15        |
| p-value ( $\chi^2$ -test) | 0.29    | 0.21        |

### VI. ESTIMATING DIE-LEVEL RELIABILITY

### 342 A. Fitting Process and Model Cross-Validation

After the threshold choice process for read disturb post-343 344 retention data we evaluated the capability of the POT-GPD 345 and POT-Weibull models to fit the exceedances of the CWs 346 error distribution. In Fig. 9, we demonstrate that both the 347 probability density function (PDF) and the CDF obtained 348 through maximum likelihood estimation (MLE) well-fit the 349 experimental data. Both models nicely describes the data. The 350 estimated GPD parameters  $\hat{\xi}$  and  $\hat{\sigma}$  and the Weibull  $\hat{\alpha}$  and are reported in Table I. We run a goodness-of-fit  $\chi^2$  test β 351 <sup>352</sup> with 0.05 confidence level to prove that the exceedances can 353 be described with both POT approaches. The test passed with  $_{354}$  a p-value = 0.29 for POT-GPD and with a p-value = 0.21 for 355 POT-Weibull (the higher the p-value the better it is), so there 356 is no evidence to discard this statistical hypothesis.

### 357 B. Calculating the POT Return Level

We put the POT models at work to predict the die-level reliability of a 3D NAND Flash chip in the read disturb postmeasured by our experimental setup. The goal of this process can be helpful as an example for system designers that fast requires a fast evaluation of the technological capabilities measurements.

The POT-GPD method enables such reliability assessment through the return level evaluation [31], [33]. The return level and the return period are two important concepts in the POT theory, thus requiring proper introduction. If we define a return period *N* of a CW that is measured in quantity of 3D NAND <sup>370</sup> Flash memory blocks, the return level, *x*, is the threshold that <sup>371</sup> is exceeded in one memory block with probability  $\frac{1}{N}$ . This <sup>372</sup> is equivalent to claim that the return level *x* is exceeded on <sup>373</sup> average once in *N* blocks. As an example, a CW with a fail bits <sup>374</sup> count/ECC capacity ratio equal to 1.1 has a return period of 3 <sup>375</sup> blocks if and only if the probability of observing a CW whose <sup>376</sup> fail bits count/ECC capacity ratio higher than 1.1 in a block <sup>377</sup> is  $\frac{1}{3}$ . In the POT theory, the return period is calculated as: <sup>378</sup>

$$N = \underbrace{\frac{\text{number of CWs exceeding the threshold}}_{\text{total number of CWs measured in blocks}} \times m \quad (11) \quad {}_{376}$$

expected number of exceedances in m blocks

From the previous equation, it follows that *N* is the number <sup>380</sup> of events over threshold between the occurrence of two consecutive CWs, both with a return period of *m* blocks. Hence, <sup>382</sup>  $\frac{1}{N}$  (i.e., the return level) is the probability of observing a CW <sup>383</sup> with a return period of *m* blocks in one block. If we choose the <sup>384</sup> CDF *F*(*x*) of a specified probability distribution (i.e., the GPD <sup>385</sup> or the Weibull used in the POT method) and *F*(*x*) =  $1 - \frac{1}{N}$ , <sup>386</sup> then *F*(*x*) is the probability of observing any CW with a fail <sup>387</sup> bits count/ECC capacity ratio less than or equal to *x* in one <sup>388</sup> block. <sup>389</sup>

Starting from that, we assumed that a 3D NAND Flash die 390 is composed by 3000 blocks and then we estimated the return 391 level per block in the case of the GPD distribution as: 392

$$x_m = u + \frac{\hat{\sigma}}{\hat{\xi}} \left[ \left( m \hat{\zeta}_u \right)^{\hat{\xi}} - 1 \right]$$
(12) 393

where *m* is the block number,  $\hat{\zeta}_u$  is the probability to have an <sup>394</sup> exceedance when a threshold *u* is considered, and  $\hat{\sigma}$  and  $\hat{\xi}$  are <sup>395</sup> the estimated parameters of the GPD distribution. In the case <sup>396</sup> of a Weibull distribution the previous return level equation <sup>397</sup> becomes: <sup>398</sup>

$$x_m = u + \hat{\alpha} \left[ log \left( m \hat{\zeta}_u \right)^{1/\hat{\beta}} \right]$$
(13) 399

where  $\hat{\alpha}$  and  $\hat{\beta}$  are the estimated parameters of the three- 400 parameters Weibull distribution, respectively. 401

The results in Fig. 10 for read disturb post-retention mea- 402 surements evidence the return level to be expected for 3000 403 blocks also considering the 95% confidence interval for the 404 GPD and Weibull distributions parameters estimates. From 405 the return level analysis, we infer two results: i) the empir- 406 ical data falls out of the POT-Weibull return level confidence 407 interval for some points; ii) for a high number of blocks, the 408 POT-GPD provides an optimistic estimation of the return level 409 with respect to the POT-Weibull (lower fail bits count/ECC 410 capacity ratio) while providing a larger return level confidence 411 interval. We also run a cross-validation test using a Holdout 412 methodology where 70% of the CWs tested in the experiments 413 are used for training the statistical models and the remain- 414 der 30% are used for testing the POT models. On a total of 415 1000 cross-validation splits we report that the median p-value 416 of the POT-GPD approach is slightly higher than that of the 417 POT-Weibull, justifying the better prediction capabilities of the 418 former model. 419



Fig. 10. Return level estimate for read disturb post-endurance of the POT-GPD and the POT-Weibull model with 95% confidence interval.





Fig. 11. Boxplot of the cross-validation splits performed for POT-GPD and POT-Weibull methods.

All these results clearly indicate that in mass storage application like SSDs and MMCs, where many blocks are considered for high storage capacity, advanced protection conconsidered for high storage capacity, advanced protection conconsidered for high storage capacity, advanced protection conconsidered for high storage capacity, advanced protection contraction must be ensured since it are in highly probable to encounter an unfavorable situation (i.e., unrecoverable errors) in some of the blocks constituting the memory die. This requires additional effort at system level to mitigate the 3D NAND Flash error probability.

### 428 C. Bootstrapping the POT Estimates

Since the POT-GPD and the POT-Weibull parameters are obtained through an MLE process we had to retrieve their confidence interval through a bootstrap analysis of the parameters with 1000 replica of the exceedances' dataset. As ingle bootstrap replica is a random sample of size  $n_u$ defined as  $(x_1^*, x_2^*, \dots, x_{nu}^*)$  drawn with replacement from the exceedances population of  $n_u$  samples retrieved with the procedure described in the former section of this work. In this data set, some appearing zero times, some appearing once or multiple times. Fig. 12 shows a quantile-quantile plot proving an ormal distribution of the POT-GPD parameters on which



Fig. 12. Bootstrap simulation on the GPD parameters by resampling 1000 times the exceedances CW in the read disturb post-retention dataset [17].



Fig. 13. Standard deviation in return level estimates depending on the chosen resampling technique. Solid lines are read disturb post-endurance data whereas dashed lines are post-retention.

it is easy to extract the confidence interval. Similar results <sup>441</sup> (not shown) are achieved for the POT-Weibull. Nevertheless, <sup>442</sup> we must report that this procedure has some issues in the <sup>443</sup> lower quantiles of the normal distribution. This is ascribed <sup>444</sup> to the MLE process convergence to a boundary point of the <sup>445</sup> parameters space for some bootstrap samples. <sup>446</sup>

Finally, we tried another resampling technique to check if  $_{447}$  we would achieve consistent results in the POT-GPD and  $_{448}$  POT-Weibull parameters estimation, namely the Jack-knife  $_{449}$  resampling. In this technique, if the original dataset of  $n_u$   $_{450}$  exceedances is employed, the i - th jack-knife sample is  $_{451}$  defined as:

$$x_{(i)} = (x_1; \ldots; x_{i-1}; x_{i+1}; \ldots; x_{nu})$$
  $i = 1; \ldots; n_u.$  (14) 453

A calculation of this method has been performed with a 454 commercial tool for matrix data manipulation. To compare the 455 prediction accuracy for both resampling technique we plotted 456 the standard deviation of the return level predictions as a func- 457 tion of the return level calculated with (12). As we can see 458 in Fig. 13, the jack-knife resampling provides the smallest 459

481

460 standard deviation for estimates (there is a difference up to 461 40 times at die level prediction) performed for read disturb 462 post-retention. This result is attributed to a small variation of 463 the new generated samples (the replica datasets differ for a <sup>464</sup> single value). To this extent, Jack-knife resampling technique <sup>465</sup> is not well suited to be used together with the POT approach, 466 since generated replicas are not so different, hence, estima-<sup>467</sup> tions based on these samples differ slightly and could lead to 468 optimistic predictions.

### VII. CONCLUSION

In this work, we validated the POT methodology as a tech-470 471 nique for EVA to be applied on a study case like the read 472 disturb reliability modeling in 3D NAND Flash memories. 473 The effectiveness of the model proven its applicability in 474 die level reliability predictions of the number of errors per 475 CW in an important scenario like the post-retention use case. 476 The methodology could be beneficial for storage system level 477 designers dealing with error mitigation schemes. In future, we 478 plan to apply the methodology to consider other 3D NAND 479 Flash reliability threats and to model extreme events in SSD 480 platforms studied at architectural level.

### REFERENCES

- [1] R. Micheloni, S. Aritome, and L. Crippa, "Array architectures for 3-D 482 NAND flash memories," Proc. IEEE, vol. 105, no. 9, pp. 1634-1649, 483 484 Sep. 2017, doi: 10.1109/JPROC.2017.2697000.
- A. S. Spinelli, C. M. Compagnoni, and A. L. Lacaita, "Reliability of [2] 485 NAND flash memories: Planar cells and emerging issues in 3D devices. 486 487 Computers, vol. 6, no. 2, p. 16, 2017, doi: 10.3390/computers6020016.
- T. A. Marquart, "Solid-state-drive qualification and reliability strategy," [3] 488 in Proc. IEEE Int. Integr. Rel. Workshop (IIRW), South Lake Tahoe, CA, 489 USA, Oct. 2015, pp. 3-6, doi: 10.1109/IIRW.2015.7437056. 490
- [4] N. Mielke et al., "Bit error rate in NAND flash memories," in Proc. 491 492 IEEE Int. Rel. Phys. Symp., Phoenix, AZ, USA, Apr. 2008, pp. 9-19, doi: 10.1109/RELPHY.2008.4558857. 493
- T. Parnell, N. Papandreou, T. Mittelholzer, and H. Pozidis, "Modelling of [5] 494 the threshold voltage distributions of sub-20nm NAND flash memory, 495 in Proc. IEEE Global Commun. Conf., Austin, TX, USA, Dec. 2014. 496 pp. 2351-2356, doi: 10.1109/GLOCOM.2014.7037159. 497
- N.-J. Wang et al., "Statistical analysis of bit-errors distribution [6] 498 for reliability of 3-D NAND flash memories," in Proc. IEEE Int. 499 Rel. Phys. Symp. (IRPS), Dallas, TX, USA, Apr. 2020, pp. 1-5, 500 doi: 10.1109/IRPS45951.2020.9128993. 501
- Zambelli, R. Micheloni, and P. Olivo, "Reliability chal-502 [7] C. lenges in 3D NAND flash memories," in Proc. IEEE 11th Int. 503 Memory Workshop (IMW), Monterey, CA, USA, May 2019, pp. 1-4, 504 doi: 10.1109/IMW.2019.8739741. 505
- L. Zuolo, C. Zambelli, R. Micheloni, and P. Olivo, "Solid-[8] 506 state drives: Memory driven design methodologies for optimal 507 performance," Proc. IEEE, vol. 105, no. 9, pp. 1589-1608, Sep. 2017, 508 doi: 10.1109/JPROC.2017.2733621. 509
- T. Zhang, Using LDPC Codes in SSD-Challenges and Solutions, Flash 510 [9] Memory Summit, Santa Clara, CA, USA, Aug. 2012. 511
- 512 [10] E. F. Haratsch, LDPC Code Concepts and Performance on High-Density Flash Memory, Flash Memory Summit, Santa Clara, CA, USA, 513 Aug. 2014. 514
- N. R. Mielke, R. E. Frickey, I. Kalastirsky, M. Quan, D. Ustinov, 515 [11] and V. J. Vasudevan, "Reliability of solid-state drives based on NAND 516 flash memory," Proc. IEEE, vol. 105, no. 9, pp. 1725-1750, Sep. 2017, 517 doi: 10.1109/JPROC.2017.2725738. 518
- S. Im and D. Shin, "Flash-aware RAID techniques for dependable and 519 [12] high-performance flash memory SSD," IEEE Trans. Comput., vol. 60, 520 no. 1, pp. 80-92, Jan. 2011, doi: 10.1109/TC.2010.197. 521
- [13] J. Kim, E. Lee, J. Choi, D. Lee, and S. H. Noh, "Chip-level RAID 522
- with flexible stripe size and parity placement for enhanced SSD relia-523 bility," IEEE Trans. Comput., vol. 65, no. 4, pp. 1116-1130, Apr. 2016, 524
- 525 doi: 10.1109/TC.2014.2375179.

- [14] Y. Li, P. P. C. Lee, and J. C. S. Lui, "Analysis of reliability dynamics 526 of SSD RAID," IEEE Trans. Comput., vol. 65, no. 4, pp. 1131-1144, 527 Apr. 2016, doi: 10.1109/TC.2014.2349505. 528
- [15] C. Zambelli, A. Marelli, R. Micheloni, and P. Olivo, "Modeling the 529 endurance reliability of intradisk RAID solutions for mid-1X TLC 530 NAND flash solid-state drives," IEEE Trans. Device Mater. Rel., vol. 17, 531 no. 4, pp. 713-721, Dec. 2017, doi: 10.1109/TDMR.2017.2749639. 532
- [16] A. Grossi, L. Zuolo, F. Restuccia, C. Zambelli, and P. Olivo, 533 "Quality-of-service implications of enhanced program algorithms for 534 charge-trapping NAND in future solid-state drives," IEEE Trans. 535 Device Mater. Rel., vol. 15, no. 3, pp. 363-369, Sep. 2015, 536 doi: 10.1109/TDMR.2015.2448108. 537
- [17] C. Zambelli, L. Crippa, R. Micheloni, and P. Olivo, "Points-over-538 threshold statistics for post-retention read disturb reliability in 3D NAND 539 flash," in Proc. IEEE Int. Integr. Rel. Workshop (IIRW), South Lake 540 Tahoe, CA, USA, 2020, pp. 1-5. 541
- M. R. Leadbetter, "On a basis for 'peaks over threshold' modeling," Stat. 542 [18] Probab. Lett., vol. 12, no. 4, pp. 357-362, 1991, doi: 10.1016/0167-543 7152(91)90107-3. 544
- [19] K. Ha, J. Jeong, and J. Kim, "A read-disturb management technique for 545 high-density NAND flash memory," in Proc. 4th Asia-Pac. Workshop 546 Syst., 2013, pp. 1-6, doi: 10.1145/2500727.2500743. 547
- [20] C. Zambelli et al., "Characterization of TLC 3D-NAND flash endurance 548 through machine learning for LDPC code rate optimization," in Proc. 549 IEEE Int. Memory Workshop (IMW), Monterey, CA, USA, 2017, 550 pp. 1-4, doi: 10.1109/IMW.2017.7939074. 55
- [21] Electrically Erasable Programmable Rom (EEPROM) Program/Erase 552 Endurance and Data Retention Test document JESD22-A117, JEDEC, 553 Arlington, VA, USA, Oct. 2018. 554
- [22] C. Zambelli, P. Olivo, L. Crippa, A. Marelli, and R. Micheloni, 555 "Uniform and concentrated read disturb effects in mid-1X TLC NAND 556 flash memories for enterprise solid state drives," in Proc. IEEE 557 Int. Rel. Phys. Symp. (IRPS), Monterey, CA, USA, 2017, pp. 1-4, 558 doi: 10.1109/IRPS.2017.7936387. 559
- C. Zambelli, R. Micheloni, S. Scommegna, and P. Olivo, "First evidence [23] 560 of temporary read errors in TLC 3D-NAND flash memories exiting from 561 an idle state," IEEE J. Electron Devices Soc., vol. 8, pp. 99-104, 2020, 562 doi: 10.1109/JEDS.2020.2965648. 563
- (2019). PM8609 NVMe2032 [24] Microsemi Flashtec NVMe 564 Controller. [Online]. Available: https://www.microsemi.com/product-565 directory/storage-ics/3687-flashtec-nvme-controllers 566
- [25] N. Papandreou et al., "Characterization and analysis of bit errors in 3D 567 TLC NAND flash memory," in Proc. IEEE Int. Rel. Phys. Symp. (IRPS), 568 Monterey, CA, USA, 2019, pp. 1-6, doi: 10.1109/IRPS.2019.8720454. 569
- [26] F. Wang et al., "Lateral charge migration induced abnormal read disturb 570 in 3D charge-trapping NAND flash memory," Appl. Phys. Exp., vol. 13, 571 no. 5, Apr. 2020, Art. no. 054002, doi: 10.35848/1882-0786/ab8729. 572
- Y. Cai, Y. Luo, S. Ghose, and O. Mutlu, "Read disturb errors in MLC [27] 573 NAND flash memory: Characterization, mitigation, and recovery," in 574 Proc. IEEE/IFIP Int. Conf. Depend. Syst. Netw., Rio de Janeiro, Brazil, 575 Jun. 2015, pp. 438-449, doi: 10.1109/DSN.2015.49. 576
- [28] V. Taranalli, H. Uchikawa, and P. H. Siegel, "On the capacity of the 577 beta-binomial channel model for multi-level cell flash memories," IEEE 578 J. Sel. Areas Commun., vol. 34, no. 9, pp. 2312-2324, Sep. 2016, 579 doi: 10.1109/JSAC.2016.2603660. 580
- [29] Y. Luo, S. Ghose, Y. Cai, E. F. Haratsch, and O. Mutlu, 581 "Improving 3D NAND flash memory lifetime by tolerating early 582 retention loss and process variation," 2018. [Online]. Available: 583 arXiv:1807.05140 584
- Y. Cai, E. F. Haratsch, O. Mutlu, and K. Mai, "Threshold voltage dis-[30] 585 tribution in MLC NAND flash memory: Characterization, analysis, and 586 modeling," in Proc. Design Autom. Test Eur. Conf. Exhibit. (DATE), 587 Grenoble, France, 2013, pp. 1285–1290, doi: 10.7873/DATE.2013.266. 588
- C. Stander, "Analysis of extreme events in the coastal engineer- 589 [31] ing environment," M.S. thesis, Dept. Appl. Math. Stellenbosch Univ., 590 Stellenbosch, South Africa, Dec. 2015. 591
- [32] N. Teena, V. S. Kumar, K. Sudheesh, and R. Sajeev, "Statistical anal- 592 ysis on extreme wave height," Nat. Hazards J. Int. Soc. Prevent. 593 Mitigation Nat. Hazards, vol. 64, no. 1, pp. 223-236, Oct. 2012, 594 doi: 10.1007/s11069-012-0229-v. 595
- S. Coles, An Introduction to Statistical Modeling of Extreme Values. 596 [33] AO3 London, U.K.: Springer, 2001. 597
- [34] E. Merran, N. Hastings, and B. Peacock, Statistical Distributions, 598 2nd ed. New York, NY, USA: Wiley, 1993. 599