Archive | April, 2015

Using deep sequencing data to estimate selection coefficients in HIV

28 Apr

Messer, P. W., & Neher, R. (2011). Estimating the strength of selective sweeps from haplotype diversity data. Genetics.

I recently reread this paper by my colleagues Philipp Messer (used to be my office mate at Stanford) and Richard Neher (who works on the population genetics of HIV, just like I do). I thought it’d be worth writing a short blog post about this paper because it has some really nice ideas but it is quite technical and you may not have read it.

Selective sweeps in HIV

Selective sweeps happen in HIV when the virus fixes immune escape mutations or drug resistance mutations. Often, we don’t have good enough time series data to determine the frequency path of the beneficial mutation (i.e., how fast does the beneficial mutation increase in frequency in the viral population). Without frequency path it is hard to quantify the selection coefficient of the beneficial mutation; how much fitter are they¬†than the virus they replace?

The authors of the paper present a new method to estimate the selection coefficient of a beneficial mutation. The method requires deep sequencing data from a population in which a beneficial mutation has recently gone to fixation. The method is applied to HIV sequences from patients in which a drug resistance mutation or an immune escape mutation has just gone to fixation. It seems to me that the method may be especially useful for drug resistance mutations because they may go to fixation rapidly and at unpredictable times, so that it is hard to follow their frequency path. The proposed method just requires a sample after fixation has happened.

The idea

The method is based on the following idea: If the selection coefficient of a beneficial mutation is very high, then the selected allele will quickly reach a high frequency without accumulating many new mutations. But if the selection coefficient is not so high, then it will take more time for the selected allele to reach a high frequency, during this time it will accumulate new mutations.

New, neutral, mutations that occur on the background of the beneficial mutation, will be dragged to a higher frequency by the beneficial mutation. If a new mutation occurs on the background of the beneficial mutation very early when there is only one copy of the beneficial mutation, then the frequency of the new mutation will always be the same as the frequency of the beneficial mutation. They likely fix in the population together. If, however, the new mutation occurs when there are already 8 copies of the beneficial mutation, then the new mutation will likely reach approximately 12% frequency (like the red fraction of the population in the figure).

This figure shows how earlier mutations on the background of the beneficial mutation reach higher frequencies.

This figure shows how earlier mutations on the background of the beneficial mutation reach higher frequencies. (Fig 1 A in the paper)

In a fast sweep, the “5 copy moment” goes by quickly

For a new, neutral, mutation on the background of the beneficial mutation to ultimately reach frequency 20% in the population, it needs to occur when the beneficial mutation is present at approximately 5 copies. The new mutation then occurs on one of the 5 copies, and is thus present on 20% of the viruses with the beneficial mutation. If the beneficial mutation fixes, the new mutation will have a population frequency of around 20%. In a slow sweep, the beneficial mutation may spend several generations at around 5 copies, whereas in a fast sweep, the “5 copy moment” goes by quickly. A mutation that happens when there are 10 copies may reach 10% freq, at 100 copies 1%. If we have many sequences from the population (say, 1000), we can look at all the new mutations and their frequencies and determine how fast the sweep went, or what the frequency path of the beneficial mutation was. If we know the frequency path, we can estimate the selection coefficient of the beneficial mutation.

Richard and Philipp used their method on HIV data because these data are deep enough to do this.

This is a sweep of a drug resistance mutation. The inset shows the genetic distances between the most common haplotypes in the dataset. All haplotypes have just one new mutation, except haplotype 13 which has 2. The main figure shows the ranks of the haplotypes on the x-axis vs their abundance (relative to the haplotype that had no new mutations) on the y-axis. Haplotype 1 (with 1 new mutation) has approximately frequency 0.05. The estimated selection coefficient is 0.07. This is figure 6 A in the paper.

This is a sweep of a drug resistance mutation. The inset shows the genetic distances between the most common haplotypes in the dataset. All haplotypes have just one new mutation, except haplotype 13 which has 2. The main figure shows the ranks of the haplotypes on the x-axis vs their abundance (relative to the haplotype that had no new mutations) on the y-axis. Haplotype 1 (with 1 new mutation) has approximately frequency 0.05, so it must have occurred when there were around 20 copies of the beneficial mutation. The estimated selection coefficient is 0.07. This is figure 6 A in the paper.

Use the method to study new infections?

I wonder whether this method can be used to see how quickly a new HIV infection is growing in a person if we’d have deep sequence data from a newly infected person.

Advertisements