Scientist Spotlight: Alennie Roldan

7 Jun
Alennie (they/them) graduated from SFSU in 2021 and will be working as a Bioinformatics Programmer in the lab of Dr. Marina Sirota.

Pleuni: Hi Alennie, congratulations on graduating this semester! 

Alennie: Thank you! I really enjoyed my time at SFSU and I’m excited to move onto the next chapter. 

Pleuni: You told me that you are starting a job at UCSF soon. Would you mind telling me what you’ll be doing there and how you found that job? 

Alennie: I’ll be working as a Bioinformatics Programmer in the lab of Dr. Marina Sirota. The work is very in line with the interdisciplinary concepts I learned through the PINC program–– coding meets life science and health data. Prior to getting the position, I heard about an event, “NIH Diversity Supplement Virtual Matchmaking,” from the PINC and SEO mailing list. At the event, I met with many different UCSF PIs and learned about their research. I kept in contact with some of the PIs I met whose research I thought was very interesting. From there I scheduled different meetings and interviews with each PI to see if we’d be a good match. I ended up moving forward with the Sirota lab because I wanted to be involved in their research and felt that I could learn a lot from the experience. 

Pleuni: When did you start to learn coding? 

Alennie: Honestly, I feel like my first stint with coding began with Tumblr. In middle and high school I picked up some HTML to personalize my Tumblr page. It was exciting to input strange strings of numbers and letters and churn out wacky graphics. When I stopped using Tumblr I didn’t seriously pick up coding until summer 2019 for the BDSP, where I learned that there were so many different ways programming could be used. 

Pleuni: Did you always want to learn coding? 

Alennie: When I was younger, I’d watch the crime show “Criminal Minds’” with my mother. One of my favorite characters was Penelope Garcia, the show’s FBI Technical Analyst. She fills the tech-savvy role of the group and I always enjoyed seeing how she’d help solve the case by unlocking “digital secrets” or finding classified information. Based on portrayals like that, I always considered coding as an exclusive skill limited to cyber security and creating complex software. So I was always interested in coding, but the idea of learning how seemed too daunting. 

Pleuni: You did the entire PINC program – which part did you like most? Which part was frustrating? 

Alennie: I enjoyed the creative freedom of the PINC program. Many of the classes I took had final projects that encouraged us to come up with our own ideas. It was satisfying and challenging to take all that I’ve learned so far and use that knowledge to come up with my own projects. One of my favorite projects was for CSC 307: Machine Learning for Life Science Data Scientists. The goal of my group’s project was to address the lack of diversity in dermatology datasets by applying a machine-learning model that could identify various skin disorders; our dataset consisted of skin image samples from People of Color. The assignment was especially rewarding because it allowed me to combine my passion for health equity, social justice, and programming into a single project. 

The most frustrating part of the program was primarily due to the pandemic. It was difficult to communicate with my professors and classmates through a remote format. The experience sometimes felt isolating because I had been so used to seeing my mentors in-person or meeting up with classmates to work on an assignment/project. Thankfully, I had met many of the same classmates in person before switching to virtual learning so I felt like I had some familiar faces to interact with. 

Pleuni: Sometimes it looks like coding is something for only some kinds of people. There are a lot of stereotypes associated with coding. How do you feel about that? 

Alennie: This is a very good question, as there are many layers to the coder/programmer stereotype. If you were to ask people to draw a picture of a coder, the most common image you’d likely see is a lonely man furiously typing in a darkened room, hunched over in his chair and focused on screens covered with indecipherable numbers and symbols. Simply put, we often imagine a typical coder as a cisgender white man who typically exhibits loner or awkward behaviors. It’s a very narrow and negative stereotype which ultimately promotes negative connotations regarding neurodivergent individuals and excludes Women and People of Color from the narrative. 

The stereotype does little to encourage or welcome most people. But in reality, the coding community at large desperately needs a diverse range of people who can contribute their unique perspectives. Stereotypes can be discouraging and unwelcoming, so it’s important for institutions to emphasize inclusivity to show how students can be fantastic coders and still be true to their unique identities. 

…it’s important for institutions to emphasize inclusivity to show how students can be fantastic coders and still be true to their unique identities.

Pleuni: I know you are applying to medical school. Do you think it is useful for a doctor to know about computer science? 

For example, by having some knowledge in computer science a doctor could aid in the design of an app that patients can use to let them know if they’re experiencing side effects to their medication, create a website that shows local doctors who are LGBTQ+ friendly, or even better navigate electronic health records. The possibilities are endless! 

Alennie: I believe that computer science can be very useful to a physician because it can improve how they can take care of people. Since they are face-to-face with patients everyday, healthcare professionals are in a position where they can recognize and understand what unique problems need to be addressed in their communities. 

Pleuni: Do you have any tips for students who are just starting out? 

Alennie: Embrace your creativity! We often think of coding as a sterile and strict subject, but as you create new programs, websites, apps, etc you realize how much creative freedom you actually have. Learning how to code can be very daunting so when you personalize programs to fit your style or reflect things that you like, it makes the journey seem less scary and more fun. When I started coding, I had the most bare-bones of tools at my disposal, but I could still find ways to inject things to make my code feel like it belonged to me. The very first game I programmed, a basic recreation of Pong, I signed with my favorite color, pastel pink.

Alennie recreated the classic game of Pong with a little extra flair for one of their coding projects.

Pleuni: Thank you, Alennie! Please stay in touch!

Scientist Spotlight: Berenice Chavez Rojas

28 May

Berenice Chavez Rojas graduated from SFSU in 2021 with a major in biology and a minor in computing applications. She is moving to Boston to work in a lab at Harvard’s Medical School.

Pleuni: Hi Berenice, congratulations on graduating this semester! 
I know that you are starting a job at Harvard soon. Would you mind telling me what you’ll be doing there and how you found that job? Did your coding skills help you land this job?

Berenice: I’ll be working as a research assistant in a wet lab. The model organism is C. elegans and the project will focus on apical-basal polarity in neurons and glia. I found this job on Twitter! Having a science Twitter is a great way to find research and job opportunities as well as learn new science from other scientists. While I won’t be using my computational skills as part of this job, the research experience I have been able to obtain with my coding skills did help me. 

“coding always seemed intimidating and unattainable”

Pleuni: When did you start to learn coding? 

Berenice: I started coding after I was accepted to the Big Data Summer Program two years ago [Note from Pleuni: this is now the PINC Summer Program]. This was also my first exposure to research and I’m grateful I was given this opportunity. This opportunity really changed my experience here at SFSU and it gave me many new opportunities that I don’t think I would have gotten had I not started coding. Following the Big Data Summer Program I started working in Dr. Rori Rohlfs’ computational biology lab. I also received a fellowship [] which allowed me to stop working my retail job, this gave me more time to focus on school and research. 

Pleuni: Did you always want to learn coding?

Berenice: Not at all, coding always seemed intimidating and unattainable. After my first exposure to coding, I still thought it was intimidating and I was slightly hesitant in taking CS classes. Once I started taking classes and the more I practiced everything began to make more sense. I also realized that Google and StackOverflow were great resources that I could access at any time. To this day, I still struggle and sometimes feel like I can’t make any progress on my code, but I remind myself that I’ve struggled many times before and I was able to persevere all those times. It just takes time!

The forensic genetics team at the Big Data Science Program in the summer of 2019. Berenice Chavez Rojas is in the middle.
The forensic genetics team at the Big Data Science Program in the summer of 2019. Berenice Chavez Rojas is in the middle.

“At the end of this project, I was able to see how much I had learned and accomplished”

Pleuni: You did the entire PINC program – which part did you like most? Which part was frustrating?

Berenice: My favorite part of the PINC program was working on a capstone project of our choice. At the end of this project, I was able to see how much I had learned and accomplished as part of the PINC program and it was a great, rewarding feeling. As with any project, our team goals changed as we made progress and as we faced new obstacles in our code. Despite taking many redirections, we made great progress and learned so much about coding, working in teams, time management, and writing scientific proposals/reports.

Link to a short video Berenice made about her capstone project:

Pleuni: Sometimes it looks like coding is something for only some kinds of people. There are a lot of stereotypes associated with coding. How do you feel about that? 

Berenice: I think computer science is seen as a male-dominated field and this makes it a lot more intimidating and may even push people away. The PINC program does a great job of creating a welcoming and accepting environment for everyone. As a minority myself, this type of environment made me feel safe and I felt like I actually belonged to a community. Programs like PINC that strive to get more students into coding are a great way to encourage students that might be nervous about taking CS classes due to stereotypes associated with such classes. 

“talking to classmates […] was really helpful”

Pleuni: Do you have any tips for students who are just starting out?

Berenice: You can do it! It is challenging to learn how to code and at times you will want to give up but you can absolutely do it. The PINC instructors and your classmates are always willing to help you. I found that talking to classmates and making a Slack channel where we could all communicate was really helpful. We would post any questions we had and anyone could help out and often times more than a few people had the same question. Since this past year was online, we would meet over Zoom if we were having trouble with homework and go over code together. Online resources such as W3Schools, YouTube tutorials and GeeksforGeeks helped me so much. Lastly, don’t bring yourself down when you’re struggling. You’ve come so far; you can and will accomplish many great things!

Pleuni: What’s your dog’s name and will it come with you to Boston?

Berenice: His name is Bowie and he’ll be staying with my family here in California. 

Pleuni: Final question. Python or R?

Berenice: I like Python, mostly because it’s the one I use the most. 

Pleuni: Thank you, Berenice! Please stay in touch!

SFSU bio and chem Master’s students do machine learning and scicomm

20 May

This semester (spring 2021) I taught a new class together with my colleagues Dax Ovid and Rori Rohlfs: Exploratory Data Science for Scientists. This class is part of our new GOLD program through which Master’s students can earn a certificate in Data Science for Biology and Chemistry (link). We were happily surprised when 38 students signed up for the class! 

In the last few weeks of the class I taught some machine learning and as their final project, students had to find their own images to do image classification with a convolutional neural network. Then they had to communicate their science to a wide audience through blog, video or twitter. Here are the results! I am very proud 🙂

If you are interested in the materials we used, let me know.


Two teams made videos about their final project: 

Anjum Gujral, Jan Mikhale Cajulao, Carlos Guzman and Cillian Variot classified flowers and trees. 

Ryan Acbay, Xavier Plasencia, Ramon Rodriguez and Amanda Verzosa looked at Asian and African elephants. 


Three teams decided to use Twitter to share their results. 

Jacob Gorneau, Pooneh Kalhori, Ariana Nagainis, Natassja Punak and Rachel Quock looked at male and female moths. 

Joshua Vargas Luna, Tatiana Marrone, Roberto (Jose) Rodrigues and Ale (Patricia) Castruita and Dacia Flores classified sand dollars. 

Jessica Magana, Casey Mitchell and Zachary Pope found cats and dogs. 


Finally, four teams wrote blogs about their projects

Adrian Barrera-Velasquez, Rudolph Cheong, Huy Do and Joel Martinez studied bagels and donuts. 

Jeremiah Ets-Hokin, Carmen Le, Saul Gamboa Peinada and Rebecca Salcedo were excited about dogs! 

Teagan Bullock, Joaquin Magana, Austin Sanchez and Michael Ward worked with memes. 

Musette Caldera, Lorenzo Mena and Ana Rodriguez Vega classified trees and flowers.

Using a Convolutional Neural Net to differentiate Bagels from Donuts

16 May

Article by: Adrian Barrera-Velasquez, Rudy Cheong, Joel Martinez, Huy Do

Why Bagels and Donuts?

Our group was initially torn on what to use for our classification assignment but ended up deciding we wanted to do something fun outside of the usual science data/image sets given we’ve all been working all semester with these. The initial suggestion was McDonald’s vs Burger King’s chicken nuggets but that seemed like it wouldn’t work too well. Keeping with the food theme however, we decided on donuts vs bagels which is actually an interesting set to compare. Morphologically, these two items are very similar but in terms of food are very different. We as humans can tell the difference between donuts and bagels pretty easily so it was interesting to see if this was enough for our neural net.

Nature of the Image Sets

As we mentioned, donuts and bagels are very similar in terms of morphology but have a very clear distinction when it comes to food. As such, they are presented differently and we can see this even in our image set. We acquired our images by writing a Python script that would automatically download Google Image search results for donuts and bagels along with their link). From a cursory glance we can see that both items are usually displayed as multiples but one of the biggest differences is that the donuts are more colorful. In addition, often times the bagels are presented as sandwiches with things like cream cheese and smoked salmon. There is a variety within each set of images but we felt like this makes it more exciting to see how well the neural net performed.

What is VGG16?

Convolutional networks have made it easier than ever to conduct large scale image and video recognition analysis. In particular, the VGG16 convolutional neural network has demonstrated superior recognition capabilities compared to other convolutional neural networks because of its network architecture. Through using small 3 × 3 convolution filters in every layer the overall depth of the network is increased. This increase in depth is what ultimately leads VGG16 to achieve a very high level accuracy in classification and localization tasks.


The VGG16 neural network returned accurate results in classifying the labels of the 10 tested bagel images and 10 tested donut images. The percentage of images classified correctly is 1.0, indicating perfect accuracy. The confusion matrix illustrates this performance where zero bagel true labels were misclassified as donuts (bottom left quadrant), and zero donut true labels were misclassified as bagels (top right quadrant). 

The compositions of the tested bagel images present a wide variance along parameters such as individual bagel or an ensemble, varying profile angles, and with or without fillings or cream cheese spreads. Regardless of this variety, VGG16 predicted the true labels of the bagel images with perfect accuracy (bottom right quadrant). The following table shows the set of 10 tested bagel images:

The compositions of the tested donut images also present a wide variance along several parameters and VGG16 predicted the true labels of the donut images with perfect accuracy (top left quadrant). The following table shows the set of 10 tested donut images: 

It is interesting to note that VGG16 accurately labeled bagel and donut image pairs that lack any major salient features useful in classifying one image as clearly bagel and the other as clearly donut. Such a pair is shown here:

The ability to make such a distinction with a minimum of distinguishing features is indicative of the power of the VGG16 neural network for images classification. 


The neural net performed so well in fact that we were left wondering if it found a very simple method of classifying these images. Personally as humans we thought that the color and toppings is an immediate dead giveaway so we think it might be a color space separation or some kind of edge density on the surface depicting textures. Unfortunately we cannot peer into the black box to see but nonetheless this was a very satisfying project and result.

Retraining the VGG16 Neural Network for Meme Classification

16 May


            The purpose of this project was to create and train a neural network model (copied from the VGG16 model created by Oxford scientists) to recognize and label images based on the “meme” category that they belong to. The two memes we chose were “Pacha” and “Doge” though this code could be adapted to include any type of meme the user wishes to classify. This project was accomplished by adapting code provided by Pleuni Pennings, Ilmi Yoon, and Ana Caballero H. All code was written in Python and executed using Google Colaboratory.

Image Preparation

            To begin, we collected 80 sample images from a simple “google images” search including 40 of each meme type, saving them in .pdf format to google drive. Images were then separated into training (n=40), validation (n=20) , and testing (n=20) data sets by resizing the images and copying a subset of them to the respective folders. Once this image preparation step was complete, we had our data structured appropriately for input into the VGG16 model.

Creating the Model

            Our image data sets were imported from google drive into the model provided by Dr. Ilmi Yoon, and all images were plotted with labels using numpy and pyplot libraries. We began with a pretrained VGG16 (Oxford) model, which was downloaded using Keras. This model was copied directly (excluding the final output layer) to create a new model to be used for our purposes, in which all layers other than the final output layer were set to be non-trainable. The final output layer of the neural network was then replaced with a new output layer specified for our unique meme classification task, containing two outputs corresponding to the two possible data classes. The output of our model is a probability density function for each image, distributed across these two classes representing the probability that the image belongs to either class.

Training the Model

            The images within the training data set were passed individually through this model, a value of the cost function was calculated for each output, and the back-propagation algorithm was then implemented to minimize this cost. This is done by adjusting the weights of the connections between the final hidden layer and the output layer, the only layer that we previously specified to be “trainable” in creating our model. This process was repeated by cycling through the entire training data set ten times, in completing ten training epochs. To assess whether this number of training epochs was optimal to achieve the ideal balance between specificity and generality, while neither underfitting nor overfitting the model to the data, the cost of the training data set was compared to that of the validation data set.

Testing the Model

            Finally, our model was tested by running the “testing” data set, containing images excluded from the training and validation data sets, through the model. The prediction accuracy of our model was calculated as the percentage of “testing” images classified correctly within the meme category that they truly belong to. Plotting these results using a confusion matrix revealed that our model was able to classify the images and distinguish between the two meme categories with 100% accuracy.

This project was completed as our final for the Exploratory Data Science for Scientists (EDSS) course, a component of the SFSU Graduate Opportunities to Learn Data Science (GOLD) program.

EDSS Team #1 – Joaquín Magaña, Michael Ward, Teagan Bullock, Austin Sanchez


How We Used Machine Learning to Classify Images of Dogs and Applications in Biology Research

16 May

By Rebecca Salcedo (, Carmen Le, Jeremiah Ets-Hokin, and Saul Gamboa

What can we do with Machine Learning?

Machine learning with images is a very powerful tool that can help aid in analyzing large data sets. In biology, there are many different applications that machine learning can have. 

Examples of Applying Machine Learning 

Identifying mutant embryo’s

One application is the ability to identify mutant embryos. Here are two images of frog embryo’s, one with a mutation in muscle development and one that is “normal”. (Photos courtesy of Dr. Julio Ramirez from Dr. Carmen Domingo’s group at San Francisco State University: You can see how similar these images are and how having to classify them manually would be both difficult and time consuming! This is a perfect place where machine learning and image classification becomes a great asset to developmental biology research. 



Coastal ecology/Marine biology research

Machine learning can also be useful in the world of marine ecology. One example of this is when trying to quantify abundance of different organisms underwater. Traditionally this would be done by SCUBA diving with a clipboard, measuring tape, and some sort of quadrat. You would individually count the abundance of whatever you are interested in and use that measurement as a subsample of the larger area. The two major limitations of this method is one: there is a lot of room for human error, and two: that SCUBA diving is limited to a very narrow amount of time and hand surveying takes a lot of time. A method that many researchers are moving two is taking images of underwater areas. The limitation with this is that they then end up with huge amounts of images that need to be analyzed. This is where machine learning comes in. You can write code that goes through huge amounts of images and classifies types of organisms. This method will allow for larger and higher resolution data sets for an underwater world that is so hard to see.

Our project 

As a group we wanted to test out how accurate machine learning with images can be. For this project we decided to see if the machine can identify if a dog exists in an image. We noticed that Google Photos cloud also has image recognition. However, there are some errors because there are times when a sea lion or an alligator is outputted when searching for images of dogs. So we wanted to see if we can code an even more accurate image recognizer. Here are some of the images with dogs and no dogs that we wanted to see if a computer could tell the difference between. 

Images with Dogs:


Images without dogs:

About the code

To be able to recognize images of dogs we could utilize some code that already existed. For our project we decided to use Oxford’s machine learning code. The machine learning code is called VGG16 and is a network that has the ability to conduct image recognition learning. You can read about it here.  

We utilize the VGG16 as a starting point to and to train it to recognize the images associated with our project. We had the help from Professors at San Francisco State University, Illmi Yoon and Pleuni Pennings, who assisted in adjusting the code to be able to obtain our image dataset, output the results, and analyze the learning accuracy of the training done with VGG16. We also had some help from twitter user @Ana_Caballero_H thanks to a blogpost showing how to split the images into the multiple folders.  

Our changes to the code

Even though an existing image recognition model existed, we couldn’t use it directly without some minor modification to both the code and our data. There were two main steps to prep our data. First was some data wrangling to train the machine with our existing images. In order to train the machine, you have to split your images into three different groups: training, test, and validation. Training images provide examples, and the test and validation images allow the model to test its skills out, see if it’s right, and fine-tune its decision making. 

While working to split our images, we encountered a number of errors. Eventually we realized that all of the images had to be in the same format. Even though all the images we had were from our cell phones, they were in a variety of different formats (jpeg, png, heic, and more). To fix this we made sure to export the images in the same format. Eventually, we were able to successfully split our images!

Additional changes to the code were made. We had to make sure that the two options it was deciding between were labeled “dogs” and “no dogs”. Importantly, this has to match the name of the folders we used to initially group the images the model was trained on.


Once all of that was done, our model was ready to go! But does it work?

Is our Machine ‘Learning’?

Our results suggest that our model is overfitting our data, i.e. is too specific! 

After the code was optimized and debugged, the pre-trained model VGG16 was imported in order to create a new model where all layers (except the output layer) were copied from the VGG16 model. This new model was trained to detect a condition i.e. “Dogs” and “No Dogs.” Next, the model was compiled in order to assess its performance, for this, we used “categorical_crossentropy” in the “loss” parameter. In this model, after the training data is evaluated, the model is adjusted to minimize the loss. We indicated 10 epochs for the number of times to go through the whole training set. And after running the model, we compared validation loss to training loss.

Our results indicate much higher validation loss than training loss, which is a clear sign that our model is overfitting. Next, we completed a final test of the model where the percent of images classified correctly was determined and a confusion matrix was produced. Our accuracy was 100% and this number was supported by the confusion matrix that shows the subset of 30 images to be classified as 17 true positives (Dogs), and the rest as true negatives (No Dogs). 

Lastly, we completed our project by visualizing the true and predicted labels of our images to visually assess how we did.

And that’s how we made our model! If you’d like to take a look at our code or the images we used to check out our repository on github.

This was all done as part of the GOLD program at SFSU, a wonderful program that provides graduate students the chance to develop their coding and data science skills.

Scientist spotlight: meet Dr Sabah Ul-Hasan!

28 Apr

Dr Ul-Hasan (they/them) is a postdoc and lecturer in bioinformatics under Dr Andrew Su and Dr Dawn Eastmond at Scripps Research, doing biocuration and automated data integration work within the Gene Wiki project of Wikidata. They received their PhD in Quantitative & Systems Biology from UC Merced, their Master’s in Biochemistry from the University of New Hampshire and their BSc degrees (3 majors! Biology, Chemistry, and Environmental & Sustainability Studies) from the University of Utah. Sabah is involved in what feels like a thousand different activities related to science, research, coding, outreach, conservation, environmental justice and other things. 

I got to know Sabah a couple of years ago when I visited UC Merced and then started following them on twitter. One thing I really love about them is how they don’t limit themselves to just doing one thing.They are ambitious and radical. They founded the Biota project to connect underrepresented communities with nature. They are a filmmaker (see here)! They volunteer for The Carpentries, and they started the venom-microbiome research consortium. They organize workshops, speak at events, teach classes and do many other things. 

In my opinion, too few scientists use their platform to fight for justice and to share their passion and knowledge. At the same time, many PhD students and postdocs and even assistant professors are shy about taking a stance, thinking that they would speak up louder (about science or justice or both) when they are more senior. But Sabah proves that you don’t have to be a tenured professor to make a difference in science (they have more than 8000 followers on twitter, just saying). 

Pleuni: Hi Sabah, thanks for taking the time to answer my questions! Could you tell us in a few sentences how you became interested in data science? 

Sabah: One of my dissertation chapters involved data that was over 100 years old. I know this isn’t a new concept for anyone doing paleo research. I was also well-familiar with “old” data through all the climate change reports that have come up in the public over the years. 

However, to directly work with data like that I realized there were so many more questions I wanted to ask people from 100 years ago. That then got me wondering, “How can I contribute to research in a way that can be sustainable 20, 50, or even 5 years from now?”. 

My interest in data science thus came from a position of wanting to be part of something bigger in terms of the infrastructure for how we can sustain the science of today and tomorrow. 

Pleuni: How did you start learning coding skills? Was it hard for you to learn? 

Sabah: I was first introduced to R during my (Biochemistry) Master’s at the University of New Hampshire in 2013. I sat-in on a casual meeting among graduate students and postdocs and truly had no idea what anyone was talking about. 

The data analysis section of my MSc thesis ended up utilizing Excel to make bar charts. In retrospect, I see how much faster I could’ve done the analyses if I took the time to learn coding. When I began the doctoral program at UC Merced in January 2015, I knew coding was a skill I wanted to learn and so I did through classes and workshops. 

Now it’s my job as a postdoctoral scholar and lecturer for bioinformatics, and I still sometimes struggle with basic concepts. The difference between then and now is I’m a lot better at admitting when I don’t know something, how to ask a question for what I need to learn, and where to go to find that answer. 

I’m not sure anyone who does bioinformatics considers themselves an expert, but perhaps the expertise lies within the ability to problem solve especially when it is difficult or can feel overwhelming. In sum, the sooner you can confront your fears the better! Don’t let them freeze you. Believe in your ability to constantly learn and grow, even when you’re a titled expert!

Pleuni: For your paper that appeared in Plos One in 2019, you studied the diversity of microorganisms (including archaea, bacteria and eukaryotes) in seawater and sediment in three different locations. It sounds like a complex dataset to work with. 

Community ecology across bacteria, archaea and microbial eukaryotes in the sediment and seawater of coastal Puerto Nuevo, Baja California

Sabah: It’s funny to only be two years out from that publication and already think of so many things I would’ve done differently. I guess that’s growth! 

I attribute a lot of credit and thanks to the co-authors of the paper and those in the acknowledgements. It came a long way from when I first drafted it to the final publication form, and posting it on bioRxiv also helped a great deal in soliciting feedback. 

What I think really makes a difference is the transparency of that research and associated code, especially in reference to data clean-up (which is the bulk of the analysis work, in personal opinion). I’ve since received several inquiries from people for their own work and to me that feels great to know that it can serve as something people can apply to their own research in making things a little easier. 

I also think it’s important we as scientists specify the microbes we’re investigating in any ‘microbial community’ -type paper. Many of the amplicon and metagenomics studies I see really focus on bacteria or fungi, which is absolutely fine but that isn’t a comprehensive microbial community for what many of the titles for these papers tend to imply. In this study, too, we focus on whatever microbial groups we identified solely through 16S and 18S. We need to be better at saying what the data is rather than wordsmithing for a nice story. That will help the next group build upon those gaps for something stronger next time, and overall our intent as scientists is to always have research be advancing further and further. Right? 

Pleuni: You used R for your data analysis (but also other software such as QIIME2). What do you like or not like about R? Could you imagine doing a paper like this one without R?

Sabah: Using wrappers such as QIIME2 and mothur are great for people who want to do an analysis of a microbial dataset and then perhaps never touch one again. For me, I found myself continuously asking a lot of “Why?” and wanting to dig deeper on the fundamentals behind what the software I was using. In the end, R took more time to learn short-term but made more sense to me of what was happening each step of the way in the analysis. It was also a good way to affirm my results in trying different avenues and seeing the same output. 

What I learned from putting together the paper is it’s not about finding the ‘right’ or ‘wrong’ answer, it’s about finding an answer that is logical and as unbiased as possible. A lot of the time we have these hypotheses we ‘prove’ through confirmation bias. To me, code (when done with intention) is a way to step outside of ourselves and see what the data is telling us rather than what we want the data to say — and that’s where the interesting science lives.

This publication, for example, wasn’t exactly what we were wanting to see. It’s actually a failed attempt at sequencing the venom microbial community of Californiconus californicus, which was the focus of my dissertation (venom microbiomes), due to too much host contamination of the tissues we sampled for that region of Puerto Nuevo. So, what do we do? Do we call it all a wash? There was a lot of thought, time, and resources that went into that work. 

I had sampled the sediment and water of the area, along with some generic chemistry tests, to see if the venom microbial community was largely specialized to the snail venom glands or from the surrounding environment (they burrow in the sand). That data was still usable, had good replication, and we didn’t know anything about the microbial community of Puerto Nuevo before that point. Ah-ha! A different story than we were thinking, but still a valuable one. Let the data tell you, don’t misconstrue the data to fit your narrative. 

R, and all the programming languages I’ve learned thus far, have helped me learn that.

Pleuni: On your twitter profile, you list many interests, such as advocacy, consulting, data visualization. Can you tell us a bit about your different interests? Are these things linked to each other?

Sabah: Well… haha. The link is that, at heart, I’m a bit of a troublemaker. It’s the nature of a scientist to ask a lot of questions, and asking too many questions can often get us into trouble! I likewise enjoy being asked a lot of questions, and hope to always maintain humility in learning just as much from high school students as I do from tenured professors. 

I wanted my Twitter profile and bio to emulate that duality of being both a ‘credible academic’ while also pushing back on what we define as ‘the norm’. I disagree with the idea that a science expert needs to possess a PhD (or some other form of higher education certification) because of the privilege and whiteness involved, but I do also benefit from it after completing the process and there is of course also danger in believing ‘just anyone’ on the internet. And I love learning and helping, which are really the only drivers behind all my many interests.

In my view, the most important quality in being a scientist is being approachable. If only a few people can understand the work you do, then what’s the point? That’s why I’m on Twitter, and also as a way to keep myself grounded, especially learning from moments of being called out (which does happen from time to time). I’d also say my family keeps me in check, as I’m one of the few with a science background. I have one cousin on my Mom’s side with a Ph.D. and that’s it for our extended family of over 100 people (South Asian families are big). Being a good scientist is just as much about humanity as it is about the basic research. I think only good things can come from staying tuned into the reality of the world around us, even though it can feel like a lot to balance.

Pleuni: Do you have any advice for the bio and chem Master’s students in my Data Science class? 

Sabah: My advice is to just go for it! 

This past Fall I taught a bioinformatics course to (mainly) graduate students and it was an adventure for all of us. It was my first time as a full instructor for a course (versus a teaching assistant), during COVID no less, and it was also the first time many students in the course were getting into bioinformatics. 

At the end, it was clear to me that student progress in the course wasn’t about who knew how much at the start but rather about showing up with enthusiasm and simply trying. That went both ways for me as the instructor giving lectures my all as well as for the students and their performance. And life happens! I had to cancel one of the days due to personal life things, and that’s okay. Be good to yourself when you need to and also don’t hold yourself back. And be good to others, too. We really never know what someone else may be experiencing behind the scenes for them to be flakey or on edge, and the more we can find the good in each other the better we can focus on doing the good science. 

On that note, I can’t express enough how much of a difference it’s made in my life to work for or alongside with even just one considerate person. As they say, “You are what you eat.”. My PhD co-advisors (Dr Tanja Woyke and Dr Clarissa Nobile) and my current PIs (Dr Su and Dr Eastmond) are truly outstanding people. They have so many stresses in their own careers and lives, and they still somehow show up with kindness and professionalism every day. And they also believe in me to do good work, even when I’ve had a bad week (or month!). That trust really goes such a long way when you’re underrepresented in your field, and often used to being discouraged and/or people expecting very little of you. Being entrusted to teach a course at a renowned research institute directly out of my PhD, for instance, is a big reason why I chose this position in knowing that my voice was heard and respected. That’s been true throughout, and makes it much easier to show up with my best foot forward even on the tough days.

Tying it all together, so many times I’ve got myself stuck because I see others who are ahead of me, doing better than me, and/or with access to more resources than me. One truth we can all agree upon is that life is unfair, and while hopefully it will become equitable over time through our own efforts to create change the fact is that life is still happening in the meantime. No one will help you as much as you can help yourself, and the moments where I’ve been able to just sit down and see something through is how I’ve realized more and more just how much more ability I have than I thought. You’re much more capable than you give yourself credit! It’s super cheesy, but it’s very true. And feel free to reach out any time!

Pleuni: Thanks for answering my questions, Sabah! So much here that resonates with me, including one of the last things you said, that you realized that you have more ability than you thought. This happens to me too! As just one example, just over a year ago, I didn’t think I could learn Machine Learning, but now I am even teaching it. Not that I am suddenly an expert, but I can do it and it is no longer scary. 

I look forward to seeing all the science, art, and justice-related projects you will be doing in the future! 


Sabah Ul-Hasan Google Scholar profile 

Sabah Ul-Hasan, PhD Twitter Profile (@sabahzero)

This week I learned about ME/CFS

12 Nov

This week I learned about ME/CFS when I sent a 2-line email to my PhD advisor Joachim Hermisson. Me: “Hope you & your family are well. Could you send me your recent Nature Reviews Genetics paper?” His reply came quick: “Sure, paper attached. We are not well. One of my daughters is very sick.”

In a few more back and forth emails, I learned that his daughter has very severe ME/CFS (myalgic encephalomyelitis/chronic fatigue syndrome). She is so sick that she is in bed in a dark room 24hours a day. She is a teenager.

Let me repeat that. This 18-year old girl is bedridden in a dark room for 24 hours a day.

When I last saw Joachim’s family in 2017 in Berkeley, she and her twin sister were healthy teenagers adjusting to school abroad.

It is thought that 20 million people worldwide have ME/CFS. It’s a chronic disease and there is no cure. Symptoms include extreme fatigue, post-exertional malaise (which means doing anything makes you more tired), severe headaches and light and sound sensitivity.

It’s thought that
20 million people have ME/CFS

Okay, you may think. Many diseases are bad. It’s always horrible when a young person gets so sick. But hey, we can’t do much about it.

But here’s the thing. According to CDC, this disease, while not exactly rare (probably 2 million people in the US and 2 million in Europe have it), is not even taught in most medical schools.

Also, ME/CFS gets almost no funding! This chart shows that almost all diseases get more research money than ME/CFS, despite its enormous impact in terms of lost “disability-adjusted life years”.

The funding situation for ME/CFS in Europe is probably worse, but nobody has the numbers.

If researchers have no money to work on this disease, no wonder that there is no diagnostic test and no FDA-approved medication.

Plus, ME/CFS is likely to get much more common with every COVID-19 wave because it is often triggered by a viral infection. Here is an article about the link between Covid-19 and ME/CFS.

After reading just a few things about ME/CFS I could see how difficult it must be to live with (or see a loved one live with) such a crippling disease that’s not even taken seriously by medical schools or funding agencies. So I asked Joachim “Is there anything I can do?” His answer was “Being informed is most important.” Later he added: “People can donate money or sign the EU petition”. Here’s a short list of action items:

4 things you can do to support people with ME/CFS

  1. First, you can read about ME/CFS. This interview has a lot of info and also tells the story of an impressive young woman (Evelien van den Brink) who petitioned the EU to spend more on ME/CFS research.

    Here you can see or read the speech Evelien van den Brink gave in the EU parliament (her speech starts at 2’10’’).
  2. Second, if you are an EU citizen, you can sign this petition to ask the EU to spend more money on ME/CFS research.
  3. Third, you can donate money for research here: or here
  4. Finally, please share this information so that more people will know about ME/CFS. 

Thanks for reading this thread / blogpost!

Note: I am sharing this story with Joachim’s permission.

How we run an inclusive & online coding program for biology and chem undergrads in 2020 

7 May

By: Nicole Adelstein, Pleuni Pennings, Rori Rohlfs

Coding summer program (BDSP) in 2018, when students were in the same room for 8 hours a week.

In 2018 this team (led by Chinomnso Okorie) met in the “yellow room” for 8 hours a week to learn R.  

We have been running combined coding/research summer programs for several years, with a  focus on undergraduate students, women, and students from historically underrepresented racial and ethnic groups. This summer, we will run our 9-week program as an online program. We think that others may be interested in doing this too, so we’ll share here how we plan to  do it. 

Some of the information below will also be published as a “ten rules paper” in Plos Computational Biology*, but we wanted to share this sooner and focus on doing things online vs in person. 

TL; DR version

  1. Have students work in teams of 4 or 5, for 2 hours per day, 4 days a week. Learning to code should be done part-time, even if your program is full time. 
  2. Use near-peer mentors to facilitate the team meetings (not to teach, but to facilitate). 
  3. Use existing online courses – we’ll share a few that we like. Don’t try to make your own curriculum last minute. There are good online courses available. 
  4. Give the students a simple (repeat: simple!) research project to work on together. 

1. Have students work in teams for two hours a day – with pre-set times. 

Learning to code is stressful and tiring. Even though many students may not have jobs this summer – it doesn’t mean that they can code for 8 hours a day. First, because they have other stuff to do (like taking care of family members) and second because there’s a limit to how long you can be an effective learner. 

Our program is 10 hours per week (8 hours of coding, 2 hours of “all-hands” meeting). We make it clear that no work is expected outside of these hours. For example, a team may meet from 10am to 12pm four days a week for coding. 

Check-ins, quiet working, shared problem solving. 

During the coding hours, the near-peer mentor is always present (on Zoom, of course!) and facilitates the meeting. The very first day should be all about introductions and expectations. After that, we suggest that every day, there is time for check-ins (everybody shares how they are doing, what they’re excited about or struggling with, or what music they’re listening to), quiet working (mute all microphones, set a timer, everybody works on the online class by themselves) and shared problem solving (for example, let’s talk about the assignment X from the online class). One of the mentors last year was successful with starting every meeting with a guided meditation. 

Each team has a faculty mentor in our program (this could be a postdoc or faculty member). Once a week, the faculty mentor joins the meeting for about 1 hour. This hour could consist of introductions / check-ins, a short presentation or story by the faculty mentor, and the opportunity for the team to ask questions. It’s great if the near-peer mentor and the team prepare questions beforehand. 

1B. Add a non-coding meeting (if you can/want)

In addition to the 8 coding hours per week, our students also meet for 2 hours per week in an “all hands meeting”. Such an all-hands meeting is not absolutely necessary, but if you have the bandwidth, it may be nice to meet once a week to do something other than coding. Maybe to read a paper together or meet with someone online (an alum who is now somewhere else? A faculty member or grad student?). 

If your program is full time (like an REU program), we suggest to still only do about 8-15 hours of coding per week. Fill up the rest with more standard things such as lectures, reading etc (and don’t make anyone do Zoom 40 hours a week!). If students are enjoying themselves with coding and getting more confident, they may do more coding by themselves, but in our program it is not the expectation. 

2. Mentors and teams are key 

When working alone, we’ve often seen students get stuck on technical problems, leaving many feeling lost and inadequate and wanting to discontinue learning this new skill. Working in a mentored team, however, students have access to immediate support from their peers and mentor. This helps them learn technical skills more efficiently, develop relationships with each other, and cultivate a shared sense of belonging in computational research (Kephart et al. 2008). We recommend that each participant in a coding summer program be assigned to a team of 4 to 5 students with similar technical skill levels led by a near-peer mentor. 

Mentors in our program are typically a year or two ahead of participants but belong to similar demographic groups and come from similar academic backgrounds. The mentor facilitates the meetings and leads the team in learning skills and applying them to a research question (without doing the work themselves). 

Each team also has a faculty advisor, who comes up with a research project that is likely to be completed in the available time and that is of interest to the students (Harackiewicz et al. 2008). The faculty advisor meets with the whole team at least once per week to guide learning and research. Of note, acting as a mentor improves students’ retention and success in STEM (Trujillo et al. 2015) therefore, this setup benefits mentors as well as mentees. 

2B. Who can be mentors? 

Over the years, we have found that near-peer mentors are incredibly useful for a number of reasons including 1) student participants are more likely to ask for help from a near-peer mentor than from a faculty advisor, 2) near-peer mentors serve as role models, giving participants an idea of what they can aim for in the next year or two, and 3) the use of mentors allows the program to serve many more participants than it could if it relied on a few time-pressed faculty advisors. Our selection criteria for mentors include essential knowledge (for example, the mentor for a team doing an advanced chemistry research project should have taken physical chemistry), mentoring experience or potential, logistical availability, and having a similar demographic background as the participants. Mentors don’t need experience with the specific coding language or research topic they will work on with their team. Rather than being the expert in the room, they are expected to help team members work together to find solutions or formulate questions for the faculty advisor. 

Mentors are crucial for the success of the program and need to be paid well for their work. Each week of the program, we pay our mentors a competitive wage for 8 contact hours with their team, a 2-hour all hands lunch meeting, a 2-hour mentor meeting, and 3-4 additional hours to account for preparation. However, we realize that this summer, things may be different for many! You may find that PhD students or Master’s students who can not work in the lab (but are still paid / on a fellowship) could be excellent near-peer mentors. Just make sure that the mentors know that this is a real commitment that will eat up a significant chunk of time each week. 

3. Identify an appropriate online course for each team

We have found that when learning basic coding skills, interactive online classes to learn computer programming (for example, from Datacamp, Udacity or Coursera) motivate and engage students better than books or online texts. Yet, when working individually, most students – especially beginners and historically underrepresented students – don’t finish online classes (Ihsen et al. 2013; Jordan 2015). As a solution, we have found that in teams, where students can work together and support each other, they learn a great deal from an online class. 

Each team’s faculty advisor picks a free, clearly structured online class with videos and assignments to teach participants coding skills. We have had good experiences with Udacity’s Exploratory Data Analysis course because this class is suitable for beginners. It does a good job motivating students to think about data and learn R. In early team meetings, participants spend time quietly working on the online class with their headphones on, followed by a team discussion or collaborative problem-solving session. If students encounter difficulty with any of the material, mentors may develop mini-lectures or create their own exercises to facilitate learning. Note, the students’ goal is not necessarily to finish the online course, but to learn enough to perform their research project. 

3B. Suggested classes:

Udacity Exploratory Data Analysis with R–ud651

CodeHS (the faculty mentor or the near-peer-mentor needs to create a section on Code HS, we use the introduction to python (rainforest).  

Coursera (this one is a tip from our UCSF colleague Dr Kala Mehta)

4. Assign each team a simple and engaging research project 

Learning to code without a specific application in mind can feel boring and irrelevant, sometimes leading students to abandon the effort. In our summer program, teams carry out a research project to motivate them to learn coding skills, improve their sense of belonging in science (Jones, Barlow, and Villarejo 2010) and cultivate their team work and time/project management skills. Faculty advisors assign each team a research project early in the program. These projects should answer real questions so that participants feel their work is valuable (Woodin, Carter, and Fletcher 2017). The projects should also be relatively simple. Small and self contained projects that can be completed within a three week time frame are ideal to ensure completion and make participants feel that their efforts have been successful. For example, past research projects in our program, which reflect the interests of faculty advisors and the students, include writing computer simulations to model the evolution of gene expression, analyzing bee observations from a large citizen science project, examining trends in google search term data with respect to teen birth outcomes, and building an app for finding parking spots on or near campus. 

For 2020, we’d like to encourage you to pick a project that appears extremely simple if you normally use R or Python to make your plots / do stats, but that would be quite challenging if you’re new to coding. We also suggest that – unless the students are already quite advanced – you don’t give them a project that you want to publish on quickly. Nobody needs more pressure this summer.  

Here are some suggestions for simple research projects

  1. Let students plot the number of COVID19 cases in their county over time using R. Let them plot the number of cases in 5 different counties on the same figure. Add an arrow for when a stay-at-home order was implemented or terminated. Easy to download data are here: 
  2. Let students keep track of how many steps they take each day for 10 days using their phone or watch. Let them plot the number of steps per day using R. Let them add a line for the mean. Collect data from 6 people and create a pdf with 6 plots in different colors. 
  3. If you have any data from your lab, let the students plot those data. Try making 4 different plots with the same data (scatter, box, histogram, etc). 
  4. Let students recreate an existing plot from a publication when the data are available. 
  5. Let students analyze (anonymized) data from your class. How strong is the correlation between midterm grades and final exam grades? Do students who hand in homework regularly do better on the test? 

* reference: Pleuni Pennings, Mayra M. Banuelos, Francisca L. Catalan, Victoria R. Caudill, Bozhidar Chakalov, Selena Hernandez, Jeanice Jones, Chinomnso Okorie, Sepideh Modrek, Rori Rohlfs, Nicole Adelstein Ten simple rules for an inclusive summer coding program for non-CS undergraduates, accepted for publication in Plos Computational Biology.

Why I don’t believe that 2.5-4% of people in Santa Clara county have had COVID19

19 Apr

Originally posted on April 19th. Small edits on April 21st. Thanks to Scott Roy and Dmitri Petrov for comments. 

This week a study was published by researchers from Stanford about how many people in Santa Clara county have been infected with the new coronavirus SARSCoV2.

You may not have heard of Santa Clara county, but it’s the heart of Silicon Valley and its most famous residents are Stanford, Google, Apple and IBM. I lived there too for a few years when I was a postdoc at Stanford.

The researchers used a new test to detect antibodies against the virus. Antibody testing is going to be super important in the near future, but I have serious concerns about this study and its conclusions.

The main result from the paper is that they estimate that between 2.5 – 4% of people in Santa Clara have had COVID19. That would mean between 48,000 and 81,000 people. If this is correct, it would mean that the virus has infected many more people in Santa Clara than the official numbers suggest. (50-85 fold more).

If 50-85 times sounds hard to believe, that’s because it is. Even though most experts agree that the real number of infected people is higher than the reported numbers, 50-85 fold higher than reported would be quite crazy. In research, we like to say that “extraordinary claims require extraordinary evidence” ( Here the claim is extraordinary but the evidence isn’t. Also, we learn that even if a study comes from a great university – this is no guarantee that the study is good.

2.5-4% seroprevalence is unlikely in Santa Clara county

Why is 2.5-4% positive in Santa Clara county an extraordinary claim? This is because in the European countries where seroprevalence is around 3%, many more people have died (relative to the size of the population) than in Santa Clara county. It would be very unlikely that the infection fatality rate (how likely you die when you catch the virus) is significantly lower in Santa Clara then in other parts of the world. For example, The Netherlands also reports a 3% seroprevalence, but has 5.5 times as many deaths per 100,000 people compared to Santa Clara county.

Two issues: biased sample and false positives

In my opinion, there are two main issues with this study. Both make that the Stanford researchers overestimate the number of people who were infected with SARS-CoV2.

One is that this was probably NOT a random sample. And two is that the false positive rate for this kind of test is high. This means that we don’t know if the people who had a positive test result have really been infected with the virus.

  1. Why is this not a random sample? 

They asked people to volunteer for this study using Facebook ads. Now, I think there is nothing wrong -in principle- with using Facebook ads to recruit people. But I do think that people who have been sick with a fever and cough recently are more likely to volunteer for this study to test whether they’ve had COVID19!

If people who actually had COVID19 were twice as likely to volunteer for the study, it would mean 2x as many positive tests in the sample and thus the conclusion that 2x as many people in the county of Santa Clara have had the disease.

This is why it is so important in statistics to have what we call “unbiased samples.”

  1. What is the “false positive rate” and why does it matter? 

Whenever you do an antibody test to see if someone has had a disease, you need to consider two kinds of mistakes that could happen. The test could come back negative even if someone had the disease – this is called a false negative – and the test could come back positive even if someone didn’t have the disease – this is called a false positive.

If a disease is rare (such as COVID19 in Santa Clara county) we need to worry mostly about the false positives. Using test data from the manufacturer, the authors estimate the specificity to be between 98.1- 99.9. (When they include their own data, the range becomes 98.3 – 99.9). This means that the false positive rate is somewhere between 0.1 and 1.9%. In other words, even if you test only people who have never had the disease, between 0.1 and 1.9% of people would still test positive.

What does all of this mean? 

Imagine we are testing 1000 people in an imaginary Santa Clara county.

Now imagine that 1% of the population has had COVID19. That would be around 10 people out of 1000. But, because people who were recently sick are more likely to volunteer for the study, maybe instead of 10, 20 people out of 1000 are positive. That’s 2% of the sample.

The other 98% of the sample should have a negative test. But, we know that the false positive rate of this test is between 0.1 and 1.9%, which means you’ll get another 1-19 people who test positive even if they never had the disease! Let’s assume for now that we get 10 false positives. Now we have in total 30 positive tests out of 1000 people tested. That could lead you to think that 3% of the sample of 1000 people has had COVID19 and thus 3% of Santa Clara county has had COVID19. Even though the real rate in our imaginary Santa Clara example was only 1%!

In the real Santa Clara study, 50 out of 3300 tests were positive (1.5%). In principle, these could all be false positives!

A lot of experts (here, here and here) are worried that the Stanford researchers have underestimated the false positive rate and have not corrected for their biased sample. And because they didn’t deal well with these two issues, they overestimate the percentage of people in Santa Clara who have had the disease.

How could this be done better?

  1. Get a more random sample. Dr Natalie Dean from Univerisity of Florida explains why household testing is the gold standard.
  2. Get a better sense of the false positive rate. Between 0.1 and 1.9 % is too wide a range if the number you are trying to measure is likely in the same range.

Why are these numbers in Santa Clara important? 

Why does it matter so much whether 0.5, 1, or 3% of people in Santa Clara have had COVID19?

Well, as of today, 73 people have died of COVID19 in Santa Clara county. If that is 73 out of 40,000 – 80,000 infected – as the Stanford researchers suggest – then the chance of dying of COVID19 is relatively low (infection fatality rate 0.1-0.2%). But if that is 73 out of, say, 10,000 – 20,000 which is more realistic, the chance of dying from COVID19 is higher (infection fatality rate 0.3-0.7%).

Because the Stanford researchers suggest that there is a ton of people in Santa Clara county who have had the disease and only relatively few who died, they suggest that the disease is maybe not so lethal. Others are taking these results and saying: “Stanford says it’s just like the flu, we can stop the lock-downs and open up the economy!”

Many public health experts think it is way too early to open up the economy and a lot more people will die if we do so now.

In fact, the reason that Santa Clara county has a very low number of people who have had the disease (probably around 1% or lower), is probably that Santa Clara county was one of the first counties in the country to issue a Stay-At-Home order and Stanford University (which is in Santa Clara county) was one of the first universities to close its campus. In many ways, Santa Clara county and Stanford have been an example in how to deal with this epidemic effectively.

I hope that if you read about more studies that use antibody tests, you read critically to determine whether their sample was random and how high the false positive rate is compared to the real positive rate they are trying to estimate.