You may have heard about ECC (Error Correcting Code) memory and thought it was just some fancy thing for servers, not something you’d need on your desktop PC. But in my opinion, modern computers used for anything more mission critical than video games need ECC memory. Without it, you’re asking for serious problems from the contents of RAM randomly changing due to stray radiation and noise that can and will flip bits here and there.
Why ECC Didn’t Use To Be As Important
Back in the days when “640KB RAM was all that anybody would ever need”, sporadic memory errors weren’t a big problem for two major reasons. The feature size of the circuits on the chip were large enough that a stray gamma particle flying through a memory cell wasn’t all that likely to flip its state. Second, there just weren’t that many memory cells in a typical computer. If the odds of one of them getting flipped was something like one in a billion per year of operation, you could expect to run that 640KB RAM for many decades without an error. For a desktop PC, that’s more than good enough.
But that was then. Today, the story is much different. The RAM memory found in the typical PC today is around 10,000 times that of the MS-DOS era PC. The memory cells sizes have shrunk dramatically, making them more susceptible to stray radiation randomly flipping a bit. The memory runs at much higher speeds, meaning that a poor connection or transient electrical noise is that much more likely to generate a bit read or write error.
SDRAM Errors Far More Common Than Previously Thought
When developing mission critical computer systems that must run 7/24, engineers know that ECC memory is favored for such applications. But exactly how often do memory cells flip and return the wrong value? The answer to this question is not straightforward. The statistics on how often memory errors occur are not well known and at times seem contradictory.
A 1998 article published in EE Times suggested that each 256MB RAM generates one bit error once per week. Given that most desktop PCs of that time had less memory than that, you might have expected your PC with 128MB RAM to run a couple weeks between memory errors.
The educated guess of the early 2000s was the SDRAM quality and computer design had improved and that you didn’t have to worry about memory errors much with less than 1GB RAM. But as the amount of RAM went up, the need for ECC did, too. Any serious server platform had support for ECC memory by around that time.
In 2009, Google published a study that looked at ECC memory controller reports of errors in their server farms. The Google memory error study found that the then-current industry estimates of memory error rates were about 15 to 1000 times too low. ECC correctable memory errors occur on average in one third of servers each year. In the servers in which they are occurring, the error rate averages around 4000 errors per year. Yikes!
The gist of the results is that many memory modules work with very low error rates, but that any module which has an error is likely to have errors much more often. Properly implemented ECC helps detect the bad modules without crashing the servers so they can be replaced at an opportune time. Google uses ECC memory in all its servers and they have so many servers that their study bears a lot of weight compared to previous attempts to quantify memory error rates.
Continue reading →