How to analyze 80,000 HIV sequences?

2 Nov

A few months ago, Bob Shafer asked me if I wanted to work with him and Susan Holmes on an editorial for Journal of Infectious Disease. Bob Shafer is a well known Stanford virologist and runs the extremely useful Stanford HIV Drug Resistance DatabaseSusan Holmes is a statistics professor at Stanford. Our task was to write an editorial for Journal of Infectious Disease about a recent paper by Joel Wertheim and colleagues.

Start with 80,000 sequences

The paper (and also our editorial) deals with the following question: how can we analyze the large numbers of HIV sequences that are available in databases to learn about the global epidemic? The basic idea behind analyzing these viral sequences is that each sequence stems from one HIV-infected person and by analyzing the genetic relationships between the  sequences, we can learn something about the global HIV epidemic. For example, if person A infects person B who then infects person C, then all three of them are expected to have very similar HIV strains, with very similar sequences. If genetically similar sequences are usually found in the same country, then we could learn that the epidemic spreads more easily within than between countries.

Build a tree

Traditionally, such analysis of HIV sequences starts with building a phylogenetic tree of the sequences. However, building trees is extremely hard if the sequences are recombining (as in HIV) and if there are a large number of sequences. One common solution to these problems is to remove recombinants and analyze a subset of the sequences. However, this means that we lose a lot of information.

Or not

Joel Wertheim and colleagues decided to take a very different approach. They started by calculating the pairwise genetic distance between all sequences. Viral sequences that are close to each other (with low pairwise distance) must stem from people who are close to each other in the epidemic. Next, they created a network by connecting all pairs of sequences that were less than 1% different from each other. The resulting network could be analyzed by standard network analysis techniques. The authors were thus able to study the relationships between 80,000 worldwide HIV sequences. They found a surprising number of international connections between the sequences.

Limitations of the network approach

In our editorial, we argue that the network approach is worth exploring, but it has its own issues too. For example, we write that “many more connections may be inferred than could have possibly existed in the real transmission network (…) For example, if multiple infections happen in a short time span, several people may be infected with very similar viruses. The viral sequences from these people would all be connected by the method of Wertheim and colleagues, leading to many more edges in the thus constructed network than exist in reality.”

It is not clear how these additional edges (connections) influence the results of the analysis. And as long as we don’t know that, we need to be very careful when interpreting results from network analyses based on sequences only. It’s clear what needs to be done!


Pleuni S. Pennings, Susan P. Holmes and Robert W. Shafer. HIV-1 Transmission Networks in a Small World. JID. 2013.

Joel O. Wertheim, Andrew J. Leigh Brown, N. Lance Hepler, Sanjay R. Mehta, Douglas D. Richman, Davey M. Smith and Sergei L. Kosakovsky Pond. The Global Transmission Network of HIV-1. JID. 2013.

One Response to “How to analyze 80,000 HIV sequences?”


  1. Editorial published in Journal of Infectious Disease | Pleuni Pennings - November 2, 2013

    […] I wrote a blog post about it here. […]

Leave a Reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out /  Change )

Google photo

You are commenting using your Google account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s

%d bloggers like this: