A scientist recently pointed me out to his colleagues. “That is not Carl Zimmer,’’ he declared.
The scientist was Mark Gerstein. He was sitting at a table in his office at Yale University, flanked by two members of his lab. “Really,’’ Gerstein said, pointing to a slim hard drive on the table, “this is Carl Zimmer.’’
By “this,’’ he meant the sequence of my genome, which was being transferred from the drive onto a MacBook.
“I’m quite serious,’’ Gerstein said. “In about five minutes, he will be in this computer.’’
I had come to Yale to give Gerstein and his colleagues my genome to explore. I wanted them to help me find out what was in there.
I was doing something far different than getting a conventional genetic test from a doctor or sending my spit to a genealogy company. Those tests typically only determine snippets of a person’s DNA, providing the sequence of less than 1 percent of the genome. Instead, I had gotten my entire genome sequenced and had then managed to get hold of all the raw data — the information that scientists use to understand how people’s genes help make them who they are.
If you could have read the data flowing into Gerstein’s MacBook, you would have seen a spreadsheet from hell. Each row contained a string of A’s, C’s, G’s, and T’s in various combinations, running a couple hundred letters long, accompanied by a few cells containing short numbers and codes. All told, there were 1.2 billion rows.
Watching my genome flow into Gerstein’s computer made me a little giddy. I began writing about DNA sequencing in the 1990s, at a time when sequencing the human genome — any human genome — seemed about as easy as a manned mission to Mars.
It took hundreds of scientists — and about $3 billion — to assemble the first human genome sequence in 2001. Since then, the cost of DNA sequencing has crashed, while the accuracy has skyrocketed. Scientists have now sequenced the genomes of an estimated 150,000 people.
Nevertheless, very few of the people who have their genomes sequenced get their hands on their own genomes. And those few people typically only get a highly filtered report. To get the raw data is almost unheard of. I am, to my knowledge, the first journalist, to do so.
Over the past several months, I enlisted Gerstein and two dozen other scientists to help me see what’s lurking in my own genome. They have volunteered their time and expertise, acting like scuba diving guides, leading me through undersea canyons.
The experience has revealed to me quite a lot about myself — but also, more importantly, about human genomes in general, and the advances scientists are making in understanding them.
Perhaps just as importantly, I’ve learned just how hard it remains for experts to make sense of anyone’s genome.
A code is broken
I had initially had my genome sequenced by Illumina, the leading manufacturer of DNA-sequencing machines.
The process was straightforward: An Illumina team took a sample of my blood, cracked open my blood cells, and then extracted their DNA.
But DNA can’t just be read from one end to the other. A human genome is so big that it would take too long. The DNA might also snap apart into pieces during the process.
Instead, Illumina team members do something counterintuitive: They smash the DNA into lots of fragments, make lots of copies of those fragments, read them all, and then try to put the sequences back together.
To do so, they take advantage of DNA’s own capacity to make copies of itself.
Each DNA molecule is actually a pair of strands assembled from building blocks known as bases. The bases are like the alphabet in which our genes are written. Instead of our 26-letter alphabet, there are only four different kinds of bases in DNA: A for adenine, T for thymine, C for cytosine, and G for guanine.
When our cells divide, they build a new copy of their DNA. They do so by splitting the old molecule into its two strands, and then building a new strand for each one. This process is remarkably simple. Each base can only pair with one other base: A with T, C with G. To read my DNA, Illumina mimicked this chemistry.
A few weeks after having my blood drawn and shipped off to Illumina, I got a call from a genetic counselor. She had my results back from Illumina.
Illumina found that I might not respond well to certain medications, but didn’t find any firm evidence that I suffered from a genetic disorder. I was also a carrier for two diseases, but they were based on mutations that were nothing for my family to worry about.
I felt relieved. But after the relief passed, the whole experience made my genome seem very boring. That seemed wrong.
Luckily, before I had my genome sequenced, I had an inkling that I would be let down. A few years ago, I was having lunch with some geneticists.
“You know what you should do someday? You should get your genome sequenced,’’ one declared. “But then you know what you should do? You should get your BAM file. If you do that, you can bring it to scientists like us. Then you can really see what’s going on.’’
I didn’t have the courage at the time to admit that I didn’t know what a BAM file was. But now it was time to find out.
BAM reveals all
A BAM file, I learned, is all the raw data that come out of genome sequencing. It’s a tremendous chunk of information, weighing in at 70 gigabytes, the equivalent of more than 400 feature-length movies. As big as it may be, nothing less will do for scientists who want to explore a genome in its full complexity.
Yet getting your own BAM file can be surprisingly hard.
Illumina, for example, states that it will provide BAM files “solely for use in clinical research.’’ It’s not surprising that companies like Illumina are wary of simply handing over BAM files to the public. Customers struggling to interpret a BAM file may mistakenly self-diagnose themselves and run after treatments they don’t need.
But all this caution puts curious people in a difficult spot.
For help, I turned to Robert Green, a geneticist at Harvard Medical School. Green is running a study called PeopleSeq, designed to find out how healthy people respond to getting their genome sequenced.
Most people in the study are only getting information about their genome filtered through their doctor. Some are looking at a carefully curated website that Illumina created. But Green had expanded the study to let participants get their hands on a hard drive with their own BAM file.
“We have created the protocol to return the hard drives, but have actually never done it yet!’’ he e-mailed me. “You might be the first.’’
I joined the PeopleSeq study, and Green’s team then asked Illumina for the BAM file. They also had me sign a form stating that I understood that the data hadn’t undergone the quality checks that Illumina had used to generate my clinical report.
Months later, a UPS box arrived at my house. Inside was a tube of green bubble wrap, inside of which was a black fabric case shaped like a kidney bean. I unzipped the case, and inside I found a hard drive with a brushed-metal gleam. The process might be clunky, but it had worked.
I was ready to enlist scientists to look at my BAM file.
Rosetta Stones
DNA sequencing is so familiar to us now, in the news and on TV crime shows, that it’s easy to get the impression reading a genome is as simple as pulling a book off a shelf and thumbing through its pages. It is not.
“Biology is complex,’’ Konrad Karczewski, a young bioinformatics expert at the Broad Institute and the first scientist to examine my BAM file, told me. “We already knew that, but I don’t think I really appreciated how ridiculous the problem is until I started doing this.’’
When Illumina sequenced my genome, what it actually did was read the sequence of 1.2 billion fragments of my DNA. At this stage, these fragments (known as reads) are like loose jigsaw puzzle pieces waiting to be put together. Some of those fragments also contain mistakes due to bad chemical reactions.
Scientists like Karczewski have a way to crack this puzzle called the human reference genome. It’s a highly accurate sequence of a single person’s genetic material. Since all people have relatively similar DNA, Karczewski could use the human reference genome to pinpoint the location of many of my own reads.
But because the reads are so short and the human genome is so long, Karczewski didn’t want to simply run a brute-force search for matches. That would have taken centuries to complete. Instead, using some shortcuts, Karczewski needed only 30 hours to figure out where most of the reads belonged.
Next, Karczewski stripped errors out of the reads. In one step in this cleansing, he took advantage of the fact that Illumina produces so many reads that they overlap on my genome many times over.
That redundancy allows Karczewski to spot the errors in DNA sequencing. If one read has an A at one spot, while the other 30 have a T at the same spot, you can safely conclude that my genome has a T there.
After two weeks of this scrubbing and fixing, Karczewski allowed the Broad Institute’s servers to write out my genome sequence.
When I went back to Cambridge and sat down with Karczewski, he opened a special kind of browser. Think of it as Google Chrome for DNA — one that showed me his reconstruction of my own genome.
To demonstrate to me how good it was, Karczewski navigated to a gene called HTT.
HTT is not just any gene. Certain mutations in HTT cause Huntington’s disease, a devastating disease that starts in middle age, leads to dementia, and ends with death. Unfortunately, these mutations can also be hard to recognize.
The problem with HTT is that the mutations strike a region of the gene that is made up of the bases C, A, and G, repeated over and over. Healthy people have a wide range of CAG repeats. It’s only when people get 37 or more in HTT that they are at risk of developing Huntington’s.
Repeating DNA is very hard to sequence accurately with short reads, because there aren’t any distinctive sequences to anchor them.
When Illumina sequenced my genome, it had not been able to reconstruct my HTT gene completely. Rather than make a bad guess, it simply left parts of it blank.
But now, thanks to Karczewski, I was looking at a complete sequence of my HTT gene. If I wanted to, I could just lean forward and count my CAG repeats.
“You know . . . I should have probably started with a crap-ton of genetic counseling before we did this,’’ Karczewski said.
Such are the risks you face when you take your genome into your own hands.
Nonetheless, I quickly did some bioethical calculations. It takes just one defective copy of HTT to cause Huntington’s disease. But none of my ancestors whom I knew of suffered from the disorder, making it unlikely they could pass it down to me.
I was pretty confident that I would have a normal HTT gene. But in the unlikely event that I did have Huntington’s disease, I decided, I’d rather know now than wait to be horribly surprised.
“Let’s look,’’ I said.
And so we did.
The reference genome has 19 CAG repeats. We counted only 17 in mine.
If Karczewski’s reconstruction was accurate, then I don’t have to worry about developing Huntington’s disease.
Whole-genome sequencing is not yet accurate enough to serve as a reliable medical test for a particular disease like Huntington’s. People who have relatives with Huntington’s and want to see if they carry the mutation should get precise tests that determine only the sequence of HTT, ignoring the rest of the genome.
But just because whole-genome sequencing isn’t 100 percent accurate doesn’t mean it’s not valuable. And, thanks to the careful assembly by Karczewski and other scientists, I at last had a reconstruction of my genome that I could explore.
About this series
STAT national correspondent Carl Zimmer takes a narrative journey through the human genome — his own. The first journalist known to have acquired the raw data of his own genome, Carl spent months interviewing leading scientists about the latest in genome research to learn more about himself and about human genomes in general.
This is an abridged version of the first part of a three-part series. To read more, go to www.
statnews.com/
gameofgenomes.
Carl Zimmer can be reached atcarl.zimmer@statnews.com. Follow Carl on Twitter @carlzimmer.