Tag Archives: R

The ridiculous order of the streets in the Excelsior (SF)

26 Sep

I live in the Excelsior neighborhood in San Francisco. My street is Athens Street. If I walk westwards from my home, I come to Vienna Street and then Naples, Edinburgh and Madrid. If you have any knowledge of map of Europe, you realize that the order makes no sense!

(Also, why is there Naples, but not Rome, and why Munich, but not Berlin? And why oh why, is there no Amsterdam Street? So many questions!)

Last week, I asked the students in the CoDE lab to create a map to show the ridiculous order of the streets in the Excelsior. They had fun figuring out how to make a map in R, so I thought I share their work here. Several students were involved, but my graduate student Olivia Pham did most of the work.

The code is here: http://rpubs.com/pleunipennings/212840


The surprising order of street names in the Excelsior neighborhood in San Francisco. We connected the cities in the order of the streets. London Street is the first city-name street if you enter the neighborhood from Mission Street, just east of London Street is Paris Street, then Lisbon Street etc. The last city-name street is Dublin Street which is closest to McLaren Park.


A map of part of the Excelsior neighborhood showing the order of the city-name streets.

How to get started with R

1 Feb


I often get asked how to get started with learning R if there is not currently a class offered. Here is what I recommend:

1. Start with a free online Code School tutorial

First of all, check out this (free) online course: https://www.codeschool.com/courses/try-r
No need to install anything, no need to pay. Students in my bioinformatics class liked this online Code School course a lot. It will not make you a master of R, but it’s a nice starting point.

2. Install R, Rstudio and swirl on your computer

Next, it is time to install R and Rstudio on your computer. Once you have that, install the swirl package. Instructions for installing R, Rstudio and swirl can be found here: http://swirlstats.com/students.html
swirl is an R package that helps you learn R while you are in the Rstudio environment. I highly recommend using the Rstudio environment! The swirl tutorials teach you the basics of vectors, matrices, logical expressions, base graphics, apply functions and many other topics. Kind words included (“Almost! Try again. Or, type info() for more options.”)

3. Dive in with great Udacity class …

If you are ready to really dive in (and have some time to invest), try out this great Udacity class: https://www.udacity.com/course/data-analysis-with-r–ud651 (no need to pay for it, you can do the free version). This class is taught by people from the Facebook data science team. They do a great job guiding you through a lot of R coding. Importantly, they always take the time to explain why you’d want to do something before they let you do it. A large part of the course is focused on using the ggplot2 package.

… or start reading The R Book

The R Book is a book by biologist and R hero Michael Crawley. The pdf of the book is available from many websites (for example: ftp://ftp.tuebingen.mpg.de/pub/kyb/bresciani/Crawley%20-%20The%20R%20Book.pdf). Make sure you also download the example data that come with the book (http://www.bio.ic.ac.uk/research/mjcraw/therbook/).

The R Book is a great resource and very clearly written. The students in my lab enjoy reading from it and trying out the code. If you are a biologist, it’ll be fun to work with the biology examples in the R book.

4. Find others who are using R or learning R.

Learning R is hard. You will get frustrated sometimes. If you know someone who is learning with you or who could help you when you are stuck, things will be easier! If there is no one near you, try to find R minded people on Twitter or elsewhere online. Also, check out the R forum on Stack Overflow (http://stackoverflow.com/questions/tagged/r) for many questions and answers on R.

Good luck!


End of summer poster session

19 Aug

Today was the last day that the summer students were in the lab (although some of them will be back next week when the semester starts). I asked each of them to make a poster with a figure they made this summer. They are learning to program in R, and making figures is a big part of what they’ve worked on. I took snap shots of some of the students with their posters. They did a great job!

2015-08-19 14.53.47 Pedro Zorzanelli da Vitoria from Brasil

2015-08-19 14.54.31 Brendan Kusuma (SFSU, undergrad)

2015-08-19 14.55.19 Julia Pyko (SFSU post bac) and Patricia Kabeja (SFSU undergrad)


2015-08-19 14.56.02 Dasha Fedorova (SFSU undergrad) made her poster together with Sidra Tufon (not in the picture).


2015-08-19 14.56.51Dwayne Evans (SFSU Master’s student)


No programming background? No problem! Learn R

14 Jun

Guest post by Rosana Callejas

Rosana Callejas

Rosana Callejas

Can someone with no programming knowledge learn “R”? The answer is yes! My name is Rosana Callejas. I am a Physiology major, and recent graduate from San Francisco State University. I began to learn the programming language “R” at the beginning of February of this year. Despite not having any previous programming experience , I analyzed my first data set of more than 20,000 data points in only a couple of months. Would you like to learn how I did it? Stay tuned.

The power of “R”

So what exactly is “R”? It is a programming language used by many data analysts, scientists, and statisticians, to analyze data, and perform statistical analysis with graphs and figures. “R” is a great tool when analyzing large data sets. It has many additional packages that can be downloaded, which allow the user to expand or simplify commands when analyzing data.

How R coded its way into my heart

Dr. Pleuni Pennings, an evolutionary biologist, and Professor at SFSU, introduced me to this wonderful tool. “I do all my research on my computer,” Dr. Pennings said, as she showed me the open program. At first, the idea puzzled me. In all my years as a biology student, I had never met a biologist like Dr. Pennings, who has made many discoveries from analyzing HIV DNA sequences using R. She explained to me that there is an accumulation of data collected by scientists everyday waiting to be analyzed. Therefore, there is a need for scientists with the skills to interpret, and draw conclusions from such large data sets. This interested me as biologist. I imagined all the new findings that could be made if all the data collected was analyzed. It would definitely contribute to the advancement of science. With this in mind, I embarked myself in the adventure of learning R.

One command at a time

I began by taking the online course “Exploratory Data Analysis with R” on Udacity.com. The course is composed of 6 lessons, in which I first learned the basics of R, a few basic commands, followed by the analysis of one variable, and how to make simple plots. In my learning, I used R, and R studio, which can be downloaded free online. I also used data sets provided by Udacity to analyze. In addition, R comes with other data sets I practiced with. My first graphing assignment was a simple bar plot (Figure 1), that represented friend count for Facebook users of different ages. This task required the package “ggplot2”, which allows graphing.


Figure 1. Friend count as function of age.

As I learned more, I began to work with different packages, new commands, and to make better graphs. I discovered how to add color to the graphs. I learned how to order variables, make subsets, group variables, add a new columns to my data sets, work with multiple variables, run correlation tests, and much more. The following are some figures that followed that first one, and show the progress of my learning as I added more detail to that first plot throughout the course.


Figure 2. Median friend count as function of age by gender.


Figure 3. Friend count as function of age.  In the green graph each point represents 20 data points in the data set. The black line represents the mean friend count. The blue line represents with the 50th quantile. The dotted lines represent the 90th and 10th quantiles.


Figure 4. The top graph represents friend count as function of age in months, with the blue line representing the mean. The middle graph represents friend count as a function of age with blue line represents the mean. The bottom graph represents friend count vs. age in moths rounded, multiplied, and divided by 5.

Figure 4. The top graph represents friend count as function of age in months, with the blue line representing the mean. The middle graph represents friend count as a function of age with blue line represents the mean. The bottom graph represents friend count vs. age in moths rounded, multiplied, and divided by 5.

Patience is the mother of all virtues

Learning R was definitely a challenge. Commands that in theory should work, sometimes did not work. As a new user, it was difficult to know exactly what had gone wrong. Fortunately, I had the guidance of Dr. Pennings who helped me through the process. I also looked for resources outside of Udacity. One great package to use along with R is “swirl,” which is a teaching package. With swirl, I learned commands not taught in the Udacity course. It has multiple lessons that give the user immediate feedback. Patience and persistence are key to learning R. Now I have seen what R can do, I know it was worth learning.

The possibilities are endless

My favorite feature of R is that the code used in a previous analysis can be saved, and reused. R users can also share pieces of code with one another, which helps expand the knowledge among users. If changes need to be made in the middle of analysis, this is rather simple, and there is no need to reanalyze the data. R can be used to study many different types of data of any size or background. Scientists such a Dr. Pennings make major findings in Biology using R.

Although new to R, I was able to begin the analysis of my own data set [1] within only a few months of learning about it. Below is a figure which resulted from the question: Which HIV regimens are most common and in what years? In order to answer this question, many hours of work were invested in preparing the data set, excluding undesired data points, sub setting, color coding, etc., ending up with 6255 HIV data points, which included only the 26 most common unique regimens as a function of time. The graph represents the most common regimens of HIV treatments taken by patients in different years. It is also organized in order of increasing number of drugs per regimen. Each regimen was color coded to include a NNRTI drug, a PI drug, or consist of nRTIs.

Figure 5. The graph represents the most common regimens of HIV treatments taken by patients in different years belonging either to NNRTI, nRTI, or PI.

Figure 5. The graph represents the most common regimens of HIV treatments taken by patients in different years belonging either to NNRTI, nRTI, or PI.

As the graph shows in 1989, and early 1990s, the HIV treatment consisted of the single drug AZT, and later in 1997, NVP. As the years progressed, regimens composed of two drugs became more common. It isn’t until 1996 that we begin to see regimens composed of three drugs. Regimens composed of three drugs are the most abundant and continue to be taken by patients up to 2013, while the single drug treatments seemed to have ceased in 2008. In 2002, we first observe regimens composed of four drugs (although RTV is often not counted as a drug, so these regimens may be considered 3-drug regimens as well), which also continue to be used along with the three drugs regimens.

R is a great program for data analysis. I believe that anyone who would like to learn it, with persistence can definitely do it. I will continue learning R, and analyzing my data set. I hope to use it as a useful tool for future investigations in my career.

[1] Thanks to Dr Robert Shafer from Stanford University for sharing the data with us!