A 2,000-year-old math theorem, along with Sudoku, may soon help researchers untangle DNA at blazing speeds.
Hunting for a particular genetic mutation in hundreds of thousands of specimens can be an expensive and time-consuming process. In the past several years, faster multiplex DNA sequencing machines have sped up the acquisition of data, but researchers have still been hobbled by having to label each sample with a unique molecular identifier (or bar code) for analysis.
Scientists at Cold Spring Harbor Laboratory (CSHL) in Long Island, N.Y., are proposing a new take on a very old idea to tackle large data sets simultaneously. The team is applying the Chinese remainder theorem to pinpoint single samples in larger pools, which are arranged in rows and columns.
Invented about 2,000 years ago, the theorem is a method for mapping information using prime and co-prime numbers. In the case of DNA sequencing and Sudoku, the theorem is used to organize data points with coordinates in a box, but it can also be used to figure out all sorts of missing information in other domains, such as distant points sensed with high-speed radar, pieces of code, and who that attractive person was that you saw at three out of seven parties on a cruise ship.
By using the idea, researchers can deal with whole libraries of genetic information instead of looking at just “one genetic sequence at a time,” says Yaniv Erlich, the lead author of the paper, published as the cover story of this month’s Genome Research.
In Sudoku players must fill every row and column each with all nine numerals, but in applying this to so many genetic samples to search, the researchers call on state-of-the-art robots, machines and programs to do the specimen placing and searching for them. “Every cell in a Sudoku [puzzle] is like a specimen, and every digit is like a genotype,” says Erlich, a doctoral student who had used the Chinese remainder theorem in previous work with radar. He brought the idea to the attention of his CSHL professor Greg Hannon. The process allows researchers to pool dozens of samples and assign the pool—rather than individual samples—with a bar-code identifier. After the sequencing machine returns results from a whole pool, a decoder program can use the theorem to work backward and locate a particular specimen. To find a mutation in a cystic fibrosis study, for example, the decoding program would use each pool’s results as the constraints to pinpoint the location of the mutated specimen.
“Think about Sudoku as a pooling theory,” he says. “You have a constraint in a row and column [to] have all nine digits. We have the same thing—maybe not as neat—but we have all the sequences in the same pool.” From there, he explains, a program can go back and use the same logic to find the mutant DNA.
In the future, sequencing and analysis that would have taken months and $10 million could require just a few days of machine time and $50,000 to $80,000, the study authors note. All thanks to ancient Chinese number logic and a popular pen-and-paper puzzle game—which Erlich now plays regularly.