Due Sept 26th
Find a biological database of your choice and download 2 closely related entries from it. How did you determine that the entries are closely related? What is the “distance” between the entries? How did you define and compute it? Put all the information on your websiteRead Results »
Due Oct 3rd
Look through the papers in RECOMB conference years 2014-2015. Choose a paper, discuss it in your group. Write a one paragraph summary of the paper (not the abstract) and a summary of your discussion: Do you like the problem and the approach? Would you like to present it to the rest of the class? If you don’t like the approach, would like to try to do it better? Post the summary and your thoughts/commentary (if any) on your website.Read Results »
In a biological context, feature selection means extracting the elements in a biological sequence (features) that contribute most substantially to some target class. For example, perhaps the presence of “ACTGT” at positions 931-935 of a chromosome means that the patient may develop some type of cancer. A paper came out in the Journal of Computational Biology that used an interesting neural network architecture and model training strategy to select the most important features in the context of cis-regulatory elements in non-coding DNA sequences . The authors claim that simpler, linear models are incapable of identifying the patterns necessary to predict the target class. So they use deep feedforward neural networks for classification. After analysis of the methods of this paper, we believe that the same techniques could be applied more generally to a broader range of biological data sets than previously studied. We are uncertain whether the same regularized neural network technique (deep feature selection) will successfully translate to a broader feature generation model. However, if successful, this model should identify and select important features for the a variety of classification tasks. The classification models we will survey can include any type of target class -- the presence or absence of cancer, hair color, eye color -- even regression tasks such as predicted life span.
We will need to collect a large set of labeled biological data, develop a high quality model that takes these biological sequences (or other biological data) and outputs the correct class, then perform the necessary modifications to transform it from a feedforward neural network to the deep feature selection model presented in the paper. At this point we are not certain if an off-the-shelf neural network package is capable of the required modifications, therefore, this project will be broken into two stages. First we will develop the high quality neural network classifier, then we have a stretch goal of converting it to a deep feature selection model as presented in the paper. We will be using Keras and Tensorflow through Python to develop the neural networks.
As our proposal suggests, we are considering the application of the deep learning model described above for a number of biological databases. This survey aims to provide a evaluation of this model for these data sets and an analysis of the findings for each biological data set category.
|DNA (coding/non-coding)||Cancer, disease, any physical features, intelligence, etc.|
|Bacteria||Bacteria species presence in microbiome -> reported symptoms|
1. Deep Feature Selection: Theory and Application to Identify Enhancers and Promoters YIFENG LI, CHIH-YU CHEN, and WYETH W. WASSERMAN, JOURNAL OF COMPUTATIONAL BIOLOGY
Volume 23, Number 5, 2016
# Mary Ann Liebert, Inc.