Research Projects
Our work focuses on the application of computational techniques to solve
problems in biology. We also collaborate directly with biologists to help
them solve real-life biology problems.
Multiple Sequence Alignment
The focus is on investigating alternative formulations of
multiple sequence alignment and on utilizing additional sequences from
biological databases to improve alignment performance. Dissatisfied with the
NP-hardness of all previous formulations, we have proposed a polynomial
time solvable formulation of multiple alignment that achieves comparable
performance with the best heuristics. This result
challenges the common conception that the multiple alignment problem is hard
and provides significant theoretical advances. A second project is based on
the observation that alignment of divergent sequences can be improved by
searching for and incorporating information in additional sequences from
database search. Instead of using hits directly from database search,
we have investigated the use of intermediate sequences to
remove excessive noise from the hits and achieved significant improvements
in performance over current state-of-the-art approaches. The most recent
project makes better use of horizontal information by taking into account
alignment of neighboring residues when aligning two residues.
Papers
- Lu Y. and Sze S.-H. (2009)
Improving accuracy of multiple sequence alignment algorithms based on alignment
of neighboring residues. Nucleic Acids Research, 37,
463-472.
- Lu Y. and Sze S.-H. (2008)
Multiple
sequence alignment based on profile
alignment of intermediate sequences.
Journal of Computational Biology, 15, 767-777.
(Also appear in Proceedings of the 11th Annual
International Conference on Research in Computational Molecular Biology
(RECOMB'2007), Lecture Notes in Bioinformatics, 4453, 283-295.)
- Sze S.-H., Lu Y. and Yang Q. (2006)
A
polynomial time solvable formulation
of multiple sequence alignment. Journal of Computational Biology,
13, 309-319. (Also appear in Proceedings of the 9th Annual
International Conference on Research in Computational Molecular Biology
(RECOMB'2005), Lecture Notes in Bioinformatics, 3500, 204-216.)
Software
- ISPAlign - Intermediate sequence profile alignment
- NRAlign - Neighboring residue alignment
- PSAlign - Preserving multiple sequence alignment
Motif Finding with Applications to Predicting Transcription Factor Binding
Sites
The focus is on developing new models and formulations to improve motif
finding performance. We have developed an algorithm that finds DNA motifs
by skipping nonconserved positions, an algorithm that integrates
sample-driven and pattern-driven approaches to significantly
improve computation time, and an improved pattern-driven algorithm that
achieves very good biological performance.
Papers
- Zhao X. and Sze S.-H. (2011)
Motif
finding in DNA sequences based on skipping nonconserved positions in
background Markov chains.
Journal of Computational Biology, 18, 759-770.
- Sze S.-H. and Zhao X. (2006)
Improved pattern-driven algorithms for motif
finding in DNA sequences. Proceedings of the 2005 Joint RECOMB
Satellite Workshops on Systems Biology and Regulatory Genomics, Lecture Notes
in Bioinformatics, 4023, 198-211.
- Sze S.-H., Lu S. and Chen J. (2004)
Integrating
sample-driven and
pattern-driven approaches in motif finding. Proceedings of the 4th
Workshop on Algorithms in Bioinformatics (WABI'2004), Lecture Notes in
Bioinformatics, 3240, 438-449.
- Sze S.-H., Gelfand M.S. and Pevzner P.A. (2002)
Finding weak motifs in DNA
sequences. Pacific Symposium on Biocomputing (PSB'2002), 235-246.
- Pevzner P.A. and Sze S.-H. (2000)
Combinatorial approaches to finding subtle
signals in DNA sequences. Proceedings of the 8th International
Conference on Intelligent Systems for Molecular Biology (ISMB'2000),
269-278.
Software
- PosMotif - Finding DNA motifs based on skipping
positions in background Markov chains
- MotifEnumerator - Improved pattern-driven
algorithms for motif finding in DNA sequences
Biological Network Analysis
The focus is on developing efficient algorithms to extract
functional substructures from biological networks and on identifying related
subnetworks from a given network. We have considered the problem of
identifying complexes from a protein interaction network by considering
different types of complexes separately.
We have also considered the problem of identifying linear biological
pathways within a network and developed a new divide-and-conquer technique to
obtain improved probabilistic and deterministic algorithms with a much lower
time complexity than previous approaches.
In another direction, we have investigated the problem of finding a path within
a given graph that is most similar to another given path, with applications to
identifying conserved pathways in different organisms.
Papers
- Fan J.-H., Chen J. and Sze S.-H. (2012)
Identifying
complexes from protein interaction networks according to different types of
neighborhood density.
Journal of Computational Biology, 19, 1284-1294.
- Chen J., Kneis J., Lu S., Mölle D., Richter S., Rossmanith P.,
Sze S.-H. and Zhang F. (2009)
Randomized
divide-and-conquer: improved path, matching, and packing algorithms.
SIAM Journal on Computing, 38, 2526-2547.
- Qian X., Sze S.-H. and Yoon B.-J. (2009)
Querying
pathways in protein interaction networks based on hidden Markov models.
Journal of Computational Biology, 16, 145-157.
(Also appear in Proceedings of the 4th Annual RECOMB Satellite Workshop on
Systems Biology.)
- Lu S., Zhang F., Chen J. and Sze S.-H. (2007)
Finding pathway structures in
protein interaction networks. Algorithmica, 48, 363-374.
- Yang Q. and Sze S.-H. (2007)
Path
matching and graph matching in biological
networks. Journal of Computational Biology, 14, 56-67.
Software
- NDComplex - Identifying complexes from a protein
interaction network by considering different types of complexes separately
- PathMatch and GraphMatch
- Path matching and graph matching in biological networks
Identification of Gene Clusters
The focus is on investigating improved formulations to better model gene
clusters. We have developed a supervised protein family classification
algorithm that overcomes the problems of existing supervised and unsupervised
algorithms, an algorithm that finds conserved gene clusters
on multiple genomes, an algorithm that allows large-scale analysis
of gene clustering in bacteria, and an algorithm that extracts statistically
significant gene clusters from a genome with the help of gene ontology
annotations.
Papers
- Yi G., Thon M.R. and Sze S.-H. (2012)
Supervised
protein family classification and new family construction.
Journal of Computational Biology, 19, 957-967.
- Yang Q., Yi G., Zhang F., Thon M.R. and Sze S.-H. (2010)
Identifying
gene clusters within localized regions in multiple genomes.
Journal of Computational Biology, 17, 657-668.
- Yang Q. and Sze S.-H. (2008)
Large-scale
analysis of gene clustering in bacteria. Genome Research,
18, 949-956.
- Yi G., Sze S.-H. and Thon M.R. (2007)
Identifying clusters
of functionally related genes in genomes. Bioinformatics,
23, 1053-1060.
Software
- SClassify - Supervised protein family classification
- GCFinder - Finding conserved gene clusters on multiple
genomes
- GCQuery - Finding related gene clusters on a given
genome from a given query cluster
Collaborations with Biologists
Papers
- Sze S.-H., Dunham J.P., Carey B., Chang P.L., Li F., Edman R.M.,
Fjeldsted C., Scott M.J., Nuzhdin S.V. and Tarone A.M. (2012)
A
de novo transcriptome assembly of Lucilia sericata
(Diptera: Calliphoridae) with predicted alternative splices, single nucleotide
polymorphisms, and transcript expression estimates.
Insect Molecular Biology, 21, 205-221.
- Madina B.R., Kuppan G., Vashisht A.A., Liang Y.-H., Downey K.M.,
Wohlschlegel J.A., Ji X., Sze S.-H., Sacchettini J.C., Read L.K.
and Cruz-Reyes J. (2011)
Guide RNA
biogenesis involves a novel RNase III family endoribonuclease in
Trypanosoma brucei.
RNA, 17, 1821-1830.
- Zhu H., Hu F., Wang R., Zhou X., Sze S.-H., Liou L.W., Barefoot A.,
Dickman M. and Zhang X. (2011)
Arabidopsis
Argonaute10 specifically sequesters miR166/165 to regulate shoot apical
meristem development.
Cell, 145, 242-256.
- Lee J.J., Hassan O.S.S., Gao W., Wei N.E., Kohel R.J., Chen X.-Y.,
Payton P., Sze S.-H., Stelly D.M. and Chen Z.J. (2006)
Developmental
and gene
expression analyses of a cotton naked seed mutant. Planta,
223, 418-432.
- Wang N., Lu S.-E., Yang Q., Sze S.-H. and Gross D.C. (2006)
Identification
of the syr-syp box in the promoter regions of genes dedicated to
syringomycin and syringopeptin production by Pseudomonas syringae pv.
syringae B301D. Journal of Bacteriology, 188,
160-168.
- Yang S.S., Cheung F., Lee J.J., Ha M., Wei N.E., Sze S.-H., Stelly D.M.,
Thaxton P., Triplett B., Town C.D. and Chen Z.J. (2006)
Accumulation of
genome-specific transcripts, transcription factors and phytohormonal regulators
during early stages of fiber cell development in allotetraploid cotton.
The Plant Journal, 47, 761-775.
- Wang J., Adelson D.L., Yilmaz, A., Sze S.-H., Jin Y. and Zhu J.J. (2005)
Genomic organization,
annotation, and ligand-receptor inferences of chicken
chemokines and chemokine receptor genes based on comparative genomics.
BMC Genomics, 6(45).
- Lunyak V.V., Burgess R., Prefontaine G.G., Nelson C., Sze S.-H.,
Chenoweth J., Schwartz P., Pevzner P.A., Glass C., Mandel G. and
Rosenfeld M.G. (2002)
Corepressor-dependent
silencing of chromosomal regions
encoding neuronal genes. Science, 298, 1747-1752.