Archive | June, 2015

No programming background? No problem! Learn R

14 Jun

Guest post by Rosana Callejas

Rosana Callejas

Rosana Callejas

Can someone with no programming knowledge learn “R”? The answer is yes! My name is Rosana Callejas. I am a Physiology major, and recent graduate from San Francisco State University. I began to learn the programming language “R” at the beginning of February of this year. Despite not having any previous programming experience , I analyzed my first data set of more than 20,000 data points in only a couple of months. Would you like to learn how I did it? Stay tuned.

The power of “R”

So what exactly is “R”? It is a programming language used by many data analysts, scientists, and statisticians, to analyze data, and perform statistical analysis with graphs and figures. “R” is a great tool when analyzing large data sets. It has many additional packages that can be downloaded, which allow the user to expand or simplify commands when analyzing data.

How R coded its way into my heart

Dr. Pleuni Pennings, an evolutionary biologist, and Professor at SFSU, introduced me to this wonderful tool. “I do all my research on my computer,” Dr. Pennings said, as she showed me the open program. At first, the idea puzzled me. In all my years as a biology student, I had never met a biologist like Dr. Pennings, who has made many discoveries from analyzing HIV DNA sequences using R. She explained to me that there is an accumulation of data collected by scientists everyday waiting to be analyzed. Therefore, there is a need for scientists with the skills to interpret, and draw conclusions from such large data sets. This interested me as biologist. I imagined all the new findings that could be made if all the data collected was analyzed. It would definitely contribute to the advancement of science. With this in mind, I embarked myself in the adventure of learning R.

One command at a time

I began by taking the online course “Exploratory Data Analysis with R” on Udacity.com. The course is composed of 6 lessons, in which I first learned the basics of R, a few basic commands, followed by the analysis of one variable, and how to make simple plots. In my learning, I used R, and R studio, which can be downloaded free online. I also used data sets provided by Udacity to analyze. In addition, R comes with other data sets I practiced with. My first graphing assignment was a simple bar plot (Figure 1), that represented friend count for Facebook users of different ages. This task required the package “ggplot2”, which allows graphing.

BlogFigure1

Figure 1. Friend count as function of age.

As I learned more, I began to work with different packages, new commands, and to make better graphs. I discovered how to add color to the graphs. I learned how to order variables, make subsets, group variables, add a new columns to my data sets, work with multiple variables, run correlation tests, and much more. The following are some figures that followed that first one, and show the progress of my learning as I added more detail to that first plot throughout the course.

BlogFigure2

Figure 2. Median friend count as function of age by gender.

BlogFigure3

Figure 3. Friend count as function of age.  In the green graph each point represents 20 data points in the data set. The black line represents the mean friend count. The blue line represents with the 50th quantile. The dotted lines represent the 90th and 10th quantiles.

 

Figure 4. The top graph represents friend count as function of age in months, with the blue line representing the mean. The middle graph represents friend count as a function of age with blue line represents the mean. The bottom graph represents friend count vs. age in moths rounded, multiplied, and divided by 5.

Figure 4. The top graph represents friend count as function of age in months, with the blue line representing the mean. The middle graph represents friend count as a function of age with blue line represents the mean. The bottom graph represents friend count vs. age in moths rounded, multiplied, and divided by 5.

Patience is the mother of all virtues

Learning R was definitely a challenge. Commands that in theory should work, sometimes did not work. As a new user, it was difficult to know exactly what had gone wrong. Fortunately, I had the guidance of Dr. Pennings who helped me through the process. I also looked for resources outside of Udacity. One great package to use along with R is “swirl,” which is a teaching package. With swirl, I learned commands not taught in the Udacity course. It has multiple lessons that give the user immediate feedback. Patience and persistence are key to learning R. Now I have seen what R can do, I know it was worth learning.

The possibilities are endless

My favorite feature of R is that the code used in a previous analysis can be saved, and reused. R users can also share pieces of code with one another, which helps expand the knowledge among users. If changes need to be made in the middle of analysis, this is rather simple, and there is no need to reanalyze the data. R can be used to study many different types of data of any size or background. Scientists such a Dr. Pennings make major findings in Biology using R.

Although new to R, I was able to begin the analysis of my own data set [1] within only a few months of learning about it. Below is a figure which resulted from the question: Which HIV regimens are most common and in what years? In order to answer this question, many hours of work were invested in preparing the data set, excluding undesired data points, sub setting, color coding, etc., ending up with 6255 HIV data points, which included only the 26 most common unique regimens as a function of time. The graph represents the most common regimens of HIV treatments taken by patients in different years. It is also organized in order of increasing number of drugs per regimen. Each regimen was color coded to include a NNRTI drug, a PI drug, or consist of nRTIs.

Figure 5. The graph represents the most common regimens of HIV treatments taken by patients in different years belonging either to NNRTI, nRTI, or PI.

Figure 5. The graph represents the most common regimens of HIV treatments taken by patients in different years belonging either to NNRTI, nRTI, or PI.

As the graph shows in 1989, and early 1990s, the HIV treatment consisted of the single drug AZT, and later in 1997, NVP. As the years progressed, regimens composed of two drugs became more common. It isn’t until 1996 that we begin to see regimens composed of three drugs. Regimens composed of three drugs are the most abundant and continue to be taken by patients up to 2013, while the single drug treatments seemed to have ceased in 2008. In 2002, we first observe regimens composed of four drugs (although RTV is often not counted as a drug, so these regimens may be considered 3-drug regimens as well), which also continue to be used along with the three drugs regimens.

R is a great program for data analysis. I believe that anyone who would like to learn it, with persistence can definitely do it. I will continue learning R, and analyzing my data set. I hope to use it as a useful tool for future investigations in my career.

[1] Thanks to Dr Robert Shafer from Stanford University for sharing the data with us!