Biological Sciences

The Ultimate Three-Dimensional Puzzle: The Protein Folding Problem


George A. Khoury, James Smadbeck, and Christodoulos A. Floudas 
Department of Chemical and Biological Engineering
Princeton University


Research in the area of protein folding aims to determine the 3-dimensional structure of a protein, given its amino acid sequence. Protein structure prediction is the inverse to protein design, where one tries to find the sequence that will fold into a desired structure.

There are more than 23,000,000 protein sequences that have been identified through genomic sequencing approaches which aim to determine an organism’s entire DNA sequence.  Only ~84,000 of them have been experimentally determined (~0.3%). The aim of research in protein folding is to be able to predict the structure of the protein with similar accuracy to that solved experimentally, without having to experimentally solve it each time. Moreover, having accurate structures of the proteins involved in the mechanisms of cancer, HIV, diabetes, and other debilitating diseases is the key to unlocking the discovery of treatments and ultimately their cure. It is impossible and exorbitantly expensive to experimentally determine all of their structures. Therefore, the ability to accurately and reliably predict their structures is of great importance.


Experimental researchers can determine the actual structure of a protein using solution and solid-state nuclear magnetic resonance (NMR) or X-ray crystallography, but current methods take months to years to determine a single structure. In addition, such experimental methods frequently encounter challenges in solving the structures of membrane proteins and proteins containing non-naturally occurring amino acids.

Protein structure prediction utilizes extreme computational resources to explore the huge number of possible structural conformations. The most accurate structure prediction methods can solve structures to sub-angstrom resolution.


Our methods apply a hierarchical approach to protein structure prediction, each having different levels of accuracy. The secondary structure prediction method CONCORD, which aims to determine which regions of the protein are alpha-helices, beta-strands, and loops, achieves a consistent 80% accuracy in blind tests.  In the second stage, our method BeST attempts to construct which beta-strands are contacting each other, and whether they are oriented in a parallel or anti-parallel fashion, achieving over 78% precision and recall within the top 5 solutions containing at least 3 beta strands. The beta strands are shown as arrows in the image, contacting in a parallel fashion.

We participated in the 10th Community Wide Experiment on the Critical Assessment of Techniques for Protein Structure Prediction (CASP10), for which we predicted over 500 structures for over 50 currently unsolved protein structures using our sequential protein structure prediction method ASTRO-FOLD. CASP is considered to be the “Olympics” of protein folding. We are now awaiting the most recent assessments. Our approach to solving the 3-D structure of a protein utilizes deterministic global optimization. Two years ago in CASP9, our most accurate prediction was for Target 580, which was ranked 3rd in the world based on distance (C-alpha RMSD ) to the true structure.

The protein folding problem is far from being solved. Structures that have high sequence similarity can utilize structural templates, which are structures similar enough in sequence that act as a “template” fold for what the new protein sequence should fold into. Template-based methods  utilize the thesis that “structure is more conserved than sequence.” New sequences for which there contain no similar ones in any of the databases are a continued challenge since the most accurate methods currently available rely on structural templates. Until one can consistently solve both the structures that have structural templates and those which do not, the protein folding problem will continue to remain unsolved.

Comments are closed.