Issue

Giving life to self-healing chips: preventing hard silicon failures

10/01/2006

In a bid to enhance the performance of microprocessors, the Semiconductor Research Corp. (SRC) recently said it has a plan to develop self-healing chips-i.e., chips that don’t fail, having been designed to diagnose when components wear out and heal themselves on the fly. The research on hard silicon failures is planned in conjunction with the National Science Foundation and the U. of Michigan.

Transistor failure (left) in an MP3 player rerouted using self-healing technology (right). (Source: SRC)Click here to enlarge image

Addressing the reliability of mainstream systems is a focus of the collaboration, according to Todd Austin, associate professor of electrical engineering at the U. of Michigan and a former Intel design engineer. “In the industry, there has yet to be support for diagnosis and repair in mainstream systems,” Austin told SST. “High-end systems that require high availability, such as phone systems and space systems, typically use triple-modular redundancy [TMR] if they can’t go down, or double-modular redundancy [DMR], if they can tolerate a small downtime for repair.” These techniques are too expensive for low-cost consumer goods and mainstream systems, he explained. “All that has been done historically to lessen this problem is aggressive burn-in, where systems are run hot and with high voltage for the first 24+ hours to ferret out any weak devices.” The researchers anticipate they will be able to dramatically advance defect tolerance for mainstream designs but at a reasonable cost-Austin said the DMR and TMR methods can cost about 100%-200% more for hardware operation, but “5%-10% is our target as a tolerable cost for mainstream systems.”

One technique that the researchers will use is photographing chips during the evolution of their failures and translating those results into design improvements. Another technique the team will incorporate is “continuous functional verification,” whereby the hardware is continuously tested. “When the hardware is found to be broken by the continuous testing, we simply undo potentially bad computation, repair the hardware by disabling broken components, and restart in degraded mode-that is, by doing your work without using that component,” noted Austin. “These approaches will lead to additional alternatives for testing and developing self-healing chips,” he said, adding that this “continuous functional verification” method is more complicated but very inexpensive, with overhead on the order of 5%-10%.

Assessing the need for self-healing chips for advanced technology devices not yet in production, Austin observed that experts are very worried about in-field transistor wear-out, with some expecting a significant portion of every chip’s transistors to work improperly below 45nm. “Some estimates predict the length of lifetime per part dropping by over an order of magnitude as we move to sub-100nm technology,” Austin noted. “That’s not only a bad scenario for the world’s populations that have limited serviceability for their electronic systems; it’s also bad for every individual, as the devices they count on could potentially have shorter lifetimes, unless the user can afford much higher-cost components.” -D.V.