Archive | May, 2021

Scientist Spotlight: Berenice Chavez Rojas

28 May

Berenice Chavez Rojas graduated from SFSU in 2021 with a major in biology and a minor in computing applications. She is moving to Boston to work in a lab at Harvard’s Medical School.

Pleuni: Hi Berenice, congratulations on graduating this semester! 
I know that you are starting a job at Harvard soon. Would you mind telling me what you’ll be doing there and how you found that job? Did your coding skills help you land this job?

Berenice: I’ll be working as a research assistant in a wet lab. The model organism is C. elegans and the project will focus on apical-basal polarity in neurons and glia. I found this job on Twitter! Having a science Twitter is a great way to find research and job opportunities as well as learn new science from other scientists. While I won’t be using my computational skills as part of this job, the research experience I have been able to obtain with my coding skills did help me. 

“coding always seemed intimidating and unattainable”

Pleuni: When did you start to learn coding? 

Berenice: I started coding after I was accepted to the Big Data Summer Program two years ago [Note from Pleuni: this is now the PINC Summer Program]. This was also my first exposure to research and I’m grateful I was given this opportunity. This opportunity really changed my experience here at SFSU and it gave me many new opportunities that I don’t think I would have gotten had I not started coding. Following the Big Data Summer Program I started working in Dr. Rori Rohlfs’ computational biology lab. I also received a fellowship [https://seo.sfsu.edu/] which allowed me to stop working my retail job, this gave me more time to focus on school and research. 

Pleuni: Did you always want to learn coding?

Berenice: Not at all, coding always seemed intimidating and unattainable. After my first exposure to coding, I still thought it was intimidating and I was slightly hesitant in taking CS classes. Once I started taking classes and the more I practiced everything began to make more sense. I also realized that Google and StackOverflow were great resources that I could access at any time. To this day, I still struggle and sometimes feel like I can’t make any progress on my code, but I remind myself that I’ve struggled many times before and I was able to persevere all those times. It just takes time!

The forensic genetics team at the Big Data Science Program in the summer of 2019. Berenice Chavez Rojas is in the middle.
The forensic genetics team at the Big Data Science Program in the summer of 2019. Berenice Chavez Rojas is in the middle.

“At the end of this project, I was able to see how much I had learned and accomplished”

Pleuni: You did the entire PINC program – which part did you like most? Which part was frustrating?

Berenice: My favorite part of the PINC program was working on a capstone project of our choice. At the end of this project, I was able to see how much I had learned and accomplished as part of the PINC program and it was a great, rewarding feeling. As with any project, our team goals changed as we made progress and as we faced new obstacles in our code. Despite taking many redirections, we made great progress and learned so much about coding, working in teams, time management, and writing scientific proposals/reports.

Link to a short video Berenice made about her capstone project: https://www.powtoon.com/c/eKaZB3kkxE5/0/m

Pleuni: Sometimes it looks like coding is something for only some kinds of people. There are a lot of stereotypes associated with coding. How do you feel about that? 

Berenice: I think computer science is seen as a male-dominated field and this makes it a lot more intimidating and may even push people away. The PINC program does a great job of creating a welcoming and accepting environment for everyone. As a minority myself, this type of environment made me feel safe and I felt like I actually belonged to a community. Programs like PINC that strive to get more students into coding are a great way to encourage students that might be nervous about taking CS classes due to stereotypes associated with such classes. 

“talking to classmates […] was really helpful”

Pleuni: Do you have any tips for students who are just starting out?

Berenice: You can do it! It is challenging to learn how to code and at times you will want to give up but you can absolutely do it. The PINC instructors and your classmates are always willing to help you. I found that talking to classmates and making a Slack channel where we could all communicate was really helpful. We would post any questions we had and anyone could help out and often times more than a few people had the same question. Since this past year was online, we would meet over Zoom if we were having trouble with homework and go over code together. Online resources such as W3Schools, YouTube tutorials and GeeksforGeeks helped me so much. Lastly, don’t bring yourself down when you’re struggling. You’ve come so far; you can and will accomplish many great things!

Pleuni: What’s your dog’s name and will it come with you to Boston?

Berenice: His name is Bowie and he’ll be staying with my family here in California. 

Pleuni: Final question. Python or R?

Berenice: I like Python, mostly because it’s the one I use the most. 

Pleuni: Thank you, Berenice! Please stay in touch!

SFSU bio and chem Master’s students do machine learning and scicomm

20 May

This semester (spring 2021) I taught a new class together with my colleagues Dax Ovid and Rori Rohlfs: Exploratory Data Science for Scientists. This class is part of our new GOLD program through which Master’s students can earn a certificate in Data Science for Biology and Chemistry (link). We were happily surprised when 38 students signed up for the class! 

In the last few weeks of the class I taught some machine learning and as their final project, students had to find their own images to do image classification with a convolutional neural network. Then they had to communicate their science to a wide audience through blog, video or twitter. Here are the results! I am very proud 🙂

If you are interested in the materials we used, let me know.

Videos

Two teams made videos about their final project: 

Anjum Gujral, Jan Mikhale Cajulao, Carlos Guzman and Cillian Variot classified flowers and trees. 

Ryan Acbay, Xavier Plasencia, Ramon Rodriguez and Amanda Verzosa looked at Asian and African elephants. 

Twitter 

Three teams decided to use Twitter to share their results. 

Jacob Gorneau, Pooneh Kalhori, Ariana Nagainis, Natassja Punak and Rachel Quock looked at male and female moths. 

Joshua Vargas Luna, Tatiana Marrone, Roberto (Jose) Rodrigues and Ale (Patricia) Castruita and Dacia Flores classified sand dollars. 

Jessica Magana, Casey Mitchell and Zachary Pope found cats and dogs. 

Blogs

Finally, four teams wrote blogs about their projects

Adrian Barrera-Velasquez, Rudolph Cheong, Huy Do and Joel Martinez studied bagels and donuts. 

Jeremiah Ets-Hokin, Carmen Le, Saul Gamboa Peinada and Rebecca Salcedo were excited about dogs! 

Teagan Bullock, Joaquin Magana, Austin Sanchez and Michael Ward worked with memes. 

Musette Caldera, Lorenzo Mena and Ana Rodriguez Vega classified trees and flowers. 

https://arodri393.wixsite.com/labsite/post/demystifying-machine-learning

Using a Convolutional Neural Net to differentiate Bagels from Donuts

16 May

Article by: Adrian Barrera-Velasquez, Rudy Cheong, Joel Martinez, Huy Do

Why Bagels and Donuts?

Our group was initially torn on what to use for our classification assignment but ended up deciding we wanted to do something fun outside of the usual science data/image sets given we’ve all been working all semester with these. The initial suggestion was McDonald’s vs Burger King’s chicken nuggets but that seemed like it wouldn’t work too well. Keeping with the food theme however, we decided on donuts vs bagels which is actually an interesting set to compare. Morphologically, these two items are very similar but in terms of food are very different. We as humans can tell the difference between donuts and bagels pretty easily so it was interesting to see if this was enough for our neural net.

Nature of the Image Sets

As we mentioned, donuts and bagels are very similar in terms of morphology but have a very clear distinction when it comes to food. As such, they are presented differently and we can see this even in our image set. We acquired our images by writing a Python script that would automatically download Google Image search results for donuts and bagels along with their link). From a cursory glance we can see that both items are usually displayed as multiples but one of the biggest differences is that the donuts are more colorful. In addition, often times the bagels are presented as sandwiches with things like cream cheese and smoked salmon. There is a variety within each set of images but we felt like this makes it more exciting to see how well the neural net performed.

What is VGG16?

Convolutional networks have made it easier than ever to conduct large scale image and video recognition analysis. In particular, the VGG16 convolutional neural network has demonstrated superior recognition capabilities compared to other convolutional neural networks because of its network architecture. Through using small 3 × 3 convolution filters in every layer the overall depth of the network is increased. This increase in depth is what ultimately leads VGG16 to achieve a very high level accuracy in classification and localization tasks.

Results

The VGG16 neural network returned accurate results in classifying the labels of the 10 tested bagel images and 10 tested donut images. The percentage of images classified correctly is 1.0, indicating perfect accuracy. The confusion matrix illustrates this performance where zero bagel true labels were misclassified as donuts (bottom left quadrant), and zero donut true labels were misclassified as bagels (top right quadrant). 

The compositions of the tested bagel images present a wide variance along parameters such as individual bagel or an ensemble, varying profile angles, and with or without fillings or cream cheese spreads. Regardless of this variety, VGG16 predicted the true labels of the bagel images with perfect accuracy (bottom right quadrant). The following table shows the set of 10 tested bagel images:

The compositions of the tested donut images also present a wide variance along several parameters and VGG16 predicted the true labels of the donut images with perfect accuracy (top left quadrant). The following table shows the set of 10 tested donut images: 

It is interesting to note that VGG16 accurately labeled bagel and donut image pairs that lack any major salient features useful in classifying one image as clearly bagel and the other as clearly donut. Such a pair is shown here:

The ability to make such a distinction with a minimum of distinguishing features is indicative of the power of the VGG16 neural network for images classification. 

Discussion

The neural net performed so well in fact that we were left wondering if it found a very simple method of classifying these images. Personally as humans we thought that the color and toppings is an immediate dead giveaway so we think it might be a color space separation or some kind of edge density on the surface depicting textures. Unfortunately we cannot peer into the black box to see but nonetheless this was a very satisfying project and result.

Retraining the VGG16 Neural Network for Meme Classification

16 May

Overview

            The purpose of this project was to create and train a neural network model (copied from the VGG16 model created by Oxford scientists) to recognize and label images based on the “meme” category that they belong to. The two memes we chose were “Pacha” and “Doge” though this code could be adapted to include any type of meme the user wishes to classify. This project was accomplished by adapting code provided by Pleuni Pennings, Ilmi Yoon, and Ana Caballero H. All code was written in Python and executed using Google Colaboratory.

Image Preparation

            To begin, we collected 80 sample images from a simple “google images” search including 40 of each meme type, saving them in .pdf format to google drive. Images were then separated into training (n=40), validation (n=20) , and testing (n=20) data sets by resizing the images and copying a subset of them to the respective folders. Once this image preparation step was complete, we had our data structured appropriately for input into the VGG16 model.

Creating the Model

            Our image data sets were imported from google drive into the model provided by Dr. Ilmi Yoon, and all images were plotted with labels using numpy and pyplot libraries. We began with a pretrained VGG16 (Oxford) model, which was downloaded using Keras. This model was copied directly (excluding the final output layer) to create a new model to be used for our purposes, in which all layers other than the final output layer were set to be non-trainable. The final output layer of the neural network was then replaced with a new output layer specified for our unique meme classification task, containing two outputs corresponding to the two possible data classes. The output of our model is a probability density function for each image, distributed across these two classes representing the probability that the image belongs to either class.

Training the Model

            The images within the training data set were passed individually through this model, a value of the cost function was calculated for each output, and the back-propagation algorithm was then implemented to minimize this cost. This is done by adjusting the weights of the connections between the final hidden layer and the output layer, the only layer that we previously specified to be “trainable” in creating our model. This process was repeated by cycling through the entire training data set ten times, in completing ten training epochs. To assess whether this number of training epochs was optimal to achieve the ideal balance between specificity and generality, while neither underfitting nor overfitting the model to the data, the cost of the training data set was compared to that of the validation data set.

Testing the Model

            Finally, our model was tested by running the “testing” data set, containing images excluded from the training and validation data sets, through the model. The prediction accuracy of our model was calculated as the percentage of “testing” images classified correctly within the meme category that they truly belong to. Plotting these results using a confusion matrix revealed that our model was able to classify the images and distinguish between the two meme categories with 100% accuracy.

This project was completed as our final for the Exploratory Data Science for Scientists (EDSS) course, a component of the SFSU Graduate Opportunities to Learn Data Science (GOLD) program.

EDSS Team #1 – Joaquín Magaña, Michael Ward, Teagan Bullock, Austin Sanchez

Image

How We Used Machine Learning to Classify Images of Dogs and Applications in Biology Research

16 May

By Rebecca Salcedo (rebeccasophiasalcedo@gmail.com), Carmen Le, Jeremiah Ets-Hokin, and Saul Gamboa

What can we do with Machine Learning?

Machine learning with images is a very powerful tool that can help aid in analyzing large data sets. In biology, there are many different applications that machine learning can have. 

Examples of Applying Machine Learning 

Identifying mutant embryo’s

One application is the ability to identify mutant embryos. Here are two images of frog embryo’s, one with a mutation in muscle development and one that is “normal”. (Photos courtesy of Dr. Julio Ramirez from Dr. Carmen Domingo’s group at San Francisco State University: https://biology.sfsu.edu/domingo-lab). You can see how similar these images are and how having to classify them manually would be both difficult and time consuming! This is a perfect place where machine learning and image classification becomes a great asset to developmental biology research. 

Mutant:

Wildtype:

Coastal ecology/Marine biology research

Machine learning can also be useful in the world of marine ecology. One example of this is when trying to quantify abundance of different organisms underwater. Traditionally this would be done by SCUBA diving with a clipboard, measuring tape, and some sort of quadrat. You would individually count the abundance of whatever you are interested in and use that measurement as a subsample of the larger area. The two major limitations of this method is one: there is a lot of room for human error, and two: that SCUBA diving is limited to a very narrow amount of time and hand surveying takes a lot of time. A method that many researchers are moving two is taking images of underwater areas. The limitation with this is that they then end up with huge amounts of images that need to be analyzed. This is where machine learning comes in. You can write code that goes through huge amounts of images and classifies types of organisms. This method will allow for larger and higher resolution data sets for an underwater world that is so hard to see.

Our project 

As a group we wanted to test out how accurate machine learning with images can be. For this project we decided to see if the machine can identify if a dog exists in an image. We noticed that Google Photos cloud also has image recognition. However, there are some errors because there are times when a sea lion or an alligator is outputted when searching for images of dogs. So we wanted to see if we can code an even more accurate image recognizer. Here are some of the images with dogs and no dogs that we wanted to see if a computer could tell the difference between. 

Images with Dogs:


.

Images without dogs:

About the code

To be able to recognize images of dogs we could utilize some code that already existed. For our project we decided to use Oxford’s machine learning code. The machine learning code is called VGG16 and is a network that has the ability to conduct image recognition learning. You can read about it here.  

We utilize the VGG16 as a starting point to and to train it to recognize the images associated with our project. We had the help from Professors at San Francisco State University, Illmi Yoon and Pleuni Pennings, who assisted in adjusting the code to be able to obtain our image dataset, output the results, and analyze the learning accuracy of the training done with VGG16. We also had some help from twitter user @Ana_Caballero_H thanks to a blogpost showing how to split the images into the multiple folders.  

Our changes to the code

Even though an existing image recognition model existed, we couldn’t use it directly without some minor modification to both the code and our data. There were two main steps to prep our data. First was some data wrangling to train the machine with our existing images. In order to train the machine, you have to split your images into three different groups: training, test, and validation. Training images provide examples, and the test and validation images allow the model to test its skills out, see if it’s right, and fine-tune its decision making. 

While working to split our images, we encountered a number of errors. Eventually we realized that all of the images had to be in the same format. Even though all the images we had were from our cell phones, they were in a variety of different formats (jpeg, png, heic, and more). To fix this we made sure to export the images in the same format. Eventually, we were able to successfully split our images!

Additional changes to the code were made. We had to make sure that the two options it was deciding between were labeled “dogs” and “no dogs”. Importantly, this has to match the name of the folders we used to initially group the images the model was trained on.

 

Once all of that was done, our model was ready to go! But does it work?

Is our Machine ‘Learning’?

Our results suggest that our model is overfitting our data, i.e. is too specific! 

After the code was optimized and debugged, the pre-trained model VGG16 was imported in order to create a new model where all layers (except the output layer) were copied from the VGG16 model. This new model was trained to detect a condition i.e. “Dogs” and “No Dogs.” Next, the model was compiled in order to assess its performance, for this, we used “categorical_crossentropy” in the “loss” parameter. In this model, after the training data is evaluated, the model is adjusted to minimize the loss. We indicated 10 epochs for the number of times to go through the whole training set. And after running the model, we compared validation loss to training loss.

Our results indicate much higher validation loss than training loss, which is a clear sign that our model is overfitting. Next, we completed a final test of the model where the percent of images classified correctly was determined and a confusion matrix was produced. Our accuracy was 100% and this number was supported by the confusion matrix that shows the subset of 30 images to be classified as 17 true positives (Dogs), and the rest as true negatives (No Dogs). 

Lastly, we completed our project by visualizing the true and predicted labels of our images to visually assess how we did.

And that’s how we made our model! If you’d like to take a look at our code or the images we used to check out our repository on github.

This was all done as part of the GOLD program at SFSU, a wonderful program that provides graduate students the chance to develop their coding and data science skills.