preSMOTEbanner.jpg

Machine Learning model to predict protein interaction given only the amino acid sequence.

Data Science using RapidMiner and PyBioMed, visualized with Blender

Our team became “Secondhand SMOTE”

 

In VCU’s Intro to Data Science class groups were given the problem to create a model to predict protein interaction from a limited supervised training dataset.

Beginning the project halfway through the semester, we had previously learned of Synthetic Minority Oversampling TEchnique (SMOTE), and how it can be helpful when dealing with missing or imbalanced data.

 
 
 

Visualizing the Problem

 

On the left we can see mostly green examples representing one of the four labels, as well as red, blue, and even 22 pink points from our supervised training dataset.

The right is after we oversampled the pink, red, and blue minority classes using SMOTE. Visually we can see that the colors are much more balanced, helping our model to better learn about the attributes that define a given label.

 
 

Random Decision Forest

After employing many of the tools RapidMiner includes like Grid Optimization, and many of the included algorithms, our group found an improved specificity for all labels when using the Random Decision Forest algorithm.

Here we see decision trees with a max depth of 3 all voting by color to predict by simple majority. Our actual model used far less trimming, but then things begin to visually look chaotic even with just one tree.

 
Previous
Previous

Every Day Calendar - HD Mod

Next
Next

AI Pastor