by John Timmer, arstechnica.com
If you’re of European descent, there’s a good chance that you can be found.
Earlier this year, news broke that police had devised an unexpected new method to crack cold cases. Rather than use a suspect’s DNA to identify them, data from the DNA was used to search public repositories and identify an alleged killer’s family members. From there, a bit of family tree building led to a limited number of suspects and the eventual identification of the person who was charged with the Golden State killings. In the months that followed, more than a dozen other cases were reported to have been solved in the same manner.
The potential for this sort of analysis had been identified by biologists as early as 2014, but they viewed it as a privacy risk—there was potential for personal information from research subjects to leak out to the public via their DNA sequences. Now, a US-Israeli team of researchers has gone through and quantified the chances of someone being identified through public genealogy data. If you live in the US and are of European descent, odds are 60 percent that you can be identified via information that your relatives have made public.
ID, the family plan
Any two humans share identical versions of the vast majority of their DNA. But there are enough differences commonly scattered across the three billion or so bases of our genomes that it’s now cheap and easy to determine which version of up to 700,000 differences people have. This screen forms the basis of personal DNA testing and genealogy services.
These differences make it easy to identify an individual’s DNA, even if they’re in a large database. While any two individuals may share the same variant at one location, everyone but identical twins will have enough differences to be distinguished. And, as you might imagine, close family members share more similarities than any two random strangers.
But as you move out along the branches of the family tree to more distant relatives like third cousins, the number of differences continues to grow. At this point, a different sort of analysis works better. Variations that are on the same chromosome are physically linked because they reside on the same DNA molecule, so they tend to be inherited together. Over time, exchanges between chromosomes will break up this run of linked variations, but this happens slowly. As a result, distant cousins may not have a huge number of shared variations, but the shared ones will all tend to cluster together as a run of identical variations on a small stretch of a chromosome. The size of those identical stretches will tend to go down over the generations.
Many of the DNA testing and genealogy services offer this analysis as a way to find lost family members who have used the same service. But they also allow you to download the data on your variations and then upload them to independent services, which may have a larger user base, and thus a better chance of picking out family members. It was one of these services that made the match that was key to the Golden Gate Killer case. Police did enough DNA testing to have a list of the killer’s variations, formatted the information appropriately, and used one of these services to identify a likely family member. From there, other genealogical information and public records could be used to build a family tree and identify likely suspects on it.
While most people would support the solving of crimes using this method, there are some basic privacy implications. Some might not be comfortable with having personal genetic information about themselves shared by their family members without permission. And, as noted above, this could be used to obtain personal health information if the participant has ever been involved in medical studies.
So, the researchers behind the new study decided to quantify the risks involved. They started with a database of 1.3 million people who had been tested by a consumer genealogy company. They then chose individuals at random from this pool and searched for distant family members. (They decided to go for distant family members because close family members often coordinate DNA tests and do them at the same time.) This involved searching for stretches of identical variants that were long enough to indicate relation, but not as long as what you’d typically see in first cousins.
In 15 percent of the cases, they were able to identify what appeared to be second cousins. Another 45 percent were third or fourth cousins for a total of a 60-percent success rate for identifying likely family members.
The researchers estimate that a database would have to have about two percent of a population in it in order to make it likely that 90 percent of the searches would produce a match to a family member. That is, of course, assuming that the database randomly samples the population. This one does not; instead, people of European descent accounted for 75 percent of the people in this database, making them 30 percent more likely to match to a family member. This is a rare case where a potential forensic tool is probably biased toward identifying wealthy white individuals.
Next up, the researchers decided to see if they could use this information to identify a specific individual, assuming they knew that individual’s approximate age and location. Using public family tree information, they estimated that any match at the DNA level could lead to about 850 family members. Limit that by sex (which DNA can tell you), and you cut the list in half. It goes down by over half if you limit it to those living within 100 miles of a given location (i.e., the site of a crime), and having the person’s age give or take five years will cut out more than 90 percent of the remaining individuals. The final result is about 16 to 17 people to screen through, something that most police forces should be capable of managing.
To see how this works practically, they used an individual who had had her genome sequenced as part of a government-funded project, and she was known to reside in Utah. An initial search pulled out two relatives, one each in Wyoming and North Dakota. Those two were distantly related to each other, and the researchers needed just an hour to identify a couple that were common ancestors of both of them. From there, they built a tree of the entire family descended from this couple. Although they complained this had to be done manually and was time-consuming, it was done within a day. From that tree, researchers were able to pull out the woman whose DNA had been sequenced.
Overall, the work reinforces the message from the initial surge in genealogical identification of criminals: on the genetic level, you shouldn’t expect much privacy, and decisions about your privacy are being made by your family (probably without consulting you).
If, for whatever reason, you’d like to maintain your privacy, the researchers have a couple of suggestions you could support. One is simply to have the government redefine private information to reflect this new reality, so that the studies it funds no longer link any personal information with DNA sequences. They also suggest that companies that offer direct-to-consumer genetics standardize on a signed, encrypted file format for information on variations. That would prevent people from taking DNA information from other sources, like DNA sequence repositories, and using it to track down your family members.
While this would mean that law enforcement would need the cooperation of a company to do the sort of searches that have been in the news, it’s likely this wouldn’t create a significant barrier to investigations.
Science, 2017. DOI: 10.1126/science.aau4832 (About DOIs).