Sequential integration of fuzzy clustering and expectation maximization for transcription factor binding site identification

Ali Yousefian-Jazi, Jinwook Choi

Research output: Contribution to journalArticleResearchpeer-review

Abstract

The identification of transcription factor binding sites (TFBSs) is a problem for which computational methods offer great hope. Thus far, the expectation maximization (EM) technique has been successfully utilized in finding TFBSs in DNA sequences, but inappropriate initialization of EM has yielded poor performance or running time scalability under a given data set. In this study, we used a sequential integration approach that defined the final solution as the set of solutions acquired from solving objectives in a cascade manner to integrate the fuzzy C-means and the EM approaches to DNA motif discovery. The new method is explained in detail and tested on the chromatin immunoprecipitation sequencing (ChIP-seq) data sets for different transcription factors (TFs) with various motif patterns. The proposed algorithm also suggests an efficient process for analyzing motif similarity to known motifs as well as finding a target motif. A comparison of results with those of the well-known motif-finding tool, MEME-ChIP, shows the advantages of our proposed framework over this existing tool. Experimental results show that we were able to find the true motifs for all TFs, and that the motifs found by our proposed algorithm were more similar to JASPAR-known motifs for the STAT1, GATA1, and JUN TFs than those found by MEME-ChIP.

Original languageEnglish
Pages (from-to)1247-1256
Number of pages10
JournalJournal of Computational Biology
Volume25
Issue number11
DOIs
StatePublished - 1 Nov 2018

Fingerprint

Transcription factors
Expectation Maximization
Fuzzy clustering
Fuzzy Clustering
Binding sites
Transcription Factor
Cluster Analysis
Transcription Factors
Binding Sites
GATA1 Transcription Factor
Chip
Nucleotide Motifs
Chromatin Immunoprecipitation
Motif Discovery
Fuzzy C-means
Chromatin
DNA sequences
Computational methods
Initialization
DNA Sequence

Keywords

  • chromatin immunoprecipitation sequencing
  • expectation maximization
  • fuzzy C-means
  • motif discovery

Cite this

@article{6aeb1d07b276409aa636d12d61a0d755,
title = "Sequential integration of fuzzy clustering and expectation maximization for transcription factor binding site identification",
abstract = "The identification of transcription factor binding sites (TFBSs) is a problem for which computational methods offer great hope. Thus far, the expectation maximization (EM) technique has been successfully utilized in finding TFBSs in DNA sequences, but inappropriate initialization of EM has yielded poor performance or running time scalability under a given data set. In this study, we used a sequential integration approach that defined the final solution as the set of solutions acquired from solving objectives in a cascade manner to integrate the fuzzy C-means and the EM approaches to DNA motif discovery. The new method is explained in detail and tested on the chromatin immunoprecipitation sequencing (ChIP-seq) data sets for different transcription factors (TFs) with various motif patterns. The proposed algorithm also suggests an efficient process for analyzing motif similarity to known motifs as well as finding a target motif. A comparison of results with those of the well-known motif-finding tool, MEME-ChIP, shows the advantages of our proposed framework over this existing tool. Experimental results show that we were able to find the true motifs for all TFs, and that the motifs found by our proposed algorithm were more similar to JASPAR-known motifs for the STAT1, GATA1, and JUN TFs than those found by MEME-ChIP.",
keywords = "chromatin immunoprecipitation sequencing, expectation maximization, fuzzy C-means, motif discovery",
author = "Ali Yousefian-Jazi and Jinwook Choi",
year = "2018",
month = "11",
day = "1",
doi = "10.1089/cmb.2017.0230",
language = "English",
volume = "25",
pages = "1247--1256",
journal = "Journal of Computational Biology",
issn = "1066-5277",
publisher = "Mary Ann Liebert Inc.",
number = "11",

}

Sequential integration of fuzzy clustering and expectation maximization for transcription factor binding site identification. / Yousefian-Jazi, Ali; Choi, Jinwook.

In: Journal of Computational Biology, Vol. 25, No. 11, 01.11.2018, p. 1247-1256.

Research output: Contribution to journalArticleResearchpeer-review

TY - JOUR

T1 - Sequential integration of fuzzy clustering and expectation maximization for transcription factor binding site identification

AU - Yousefian-Jazi, Ali

AU - Choi, Jinwook

PY - 2018/11/1

Y1 - 2018/11/1

N2 - The identification of transcription factor binding sites (TFBSs) is a problem for which computational methods offer great hope. Thus far, the expectation maximization (EM) technique has been successfully utilized in finding TFBSs in DNA sequences, but inappropriate initialization of EM has yielded poor performance or running time scalability under a given data set. In this study, we used a sequential integration approach that defined the final solution as the set of solutions acquired from solving objectives in a cascade manner to integrate the fuzzy C-means and the EM approaches to DNA motif discovery. The new method is explained in detail and tested on the chromatin immunoprecipitation sequencing (ChIP-seq) data sets for different transcription factors (TFs) with various motif patterns. The proposed algorithm also suggests an efficient process for analyzing motif similarity to known motifs as well as finding a target motif. A comparison of results with those of the well-known motif-finding tool, MEME-ChIP, shows the advantages of our proposed framework over this existing tool. Experimental results show that we were able to find the true motifs for all TFs, and that the motifs found by our proposed algorithm were more similar to JASPAR-known motifs for the STAT1, GATA1, and JUN TFs than those found by MEME-ChIP.

AB - The identification of transcription factor binding sites (TFBSs) is a problem for which computational methods offer great hope. Thus far, the expectation maximization (EM) technique has been successfully utilized in finding TFBSs in DNA sequences, but inappropriate initialization of EM has yielded poor performance or running time scalability under a given data set. In this study, we used a sequential integration approach that defined the final solution as the set of solutions acquired from solving objectives in a cascade manner to integrate the fuzzy C-means and the EM approaches to DNA motif discovery. The new method is explained in detail and tested on the chromatin immunoprecipitation sequencing (ChIP-seq) data sets for different transcription factors (TFs) with various motif patterns. The proposed algorithm also suggests an efficient process for analyzing motif similarity to known motifs as well as finding a target motif. A comparison of results with those of the well-known motif-finding tool, MEME-ChIP, shows the advantages of our proposed framework over this existing tool. Experimental results show that we were able to find the true motifs for all TFs, and that the motifs found by our proposed algorithm were more similar to JASPAR-known motifs for the STAT1, GATA1, and JUN TFs than those found by MEME-ChIP.

KW - chromatin immunoprecipitation sequencing

KW - expectation maximization

KW - fuzzy C-means

KW - motif discovery

UR - http://www.scopus.com/inward/record.url?scp=85056330946&partnerID=8YFLogxK

U2 - 10.1089/cmb.2017.0230

DO - 10.1089/cmb.2017.0230

M3 - Article

VL - 25

SP - 1247

EP - 1256

JO - Journal of Computational Biology

JF - Journal of Computational Biology

SN - 1066-5277

IS - 11

ER -