Disorder in protein structure and function have received very little attention even though the significance of these disordered regions from the functional aspects cannot be denied. Protein Data Bank (PDB) which contains the largest collection of structural information of proteins is also the main source of disordered proteins. However, due to the lack of organised annotations and collective data on disordered regions from this database, the biological community at large is unaware of the importance of disorder. Hence, we propose here a database that would provide more accessibility to a systematic collection of disordered proteins from the PDB. We hope this would help to bring about more awareness in biologists for future studies of disorder in proteins.
a) to reduce the task of biologists in collecting /extracting potential candidates that might be disordered
b) to provide a means for biologists to retrieve/extract information on disordered proteins in a collective and global manner
Methods:
General search:
Database source:Protein Database of Disorder (created and categorised based on the keywords).
A list of selected keywords indicative of "disorder" is chosen to retrieve/extract entries from the PDB. The database generated relies on user input of the chosen keyword(s). Each keyword is linked to a separate database which describes the many usages of that particular keyword in the entire PDB which would give suggestions in the context of "disorder". The output would include entries with at least one of the chosen keywords if more than one keyword are used for the search. In this manner, based on the user's discretion, entries which carry the best criteria (eg. entries with the highest number of keyword scores) may be selected for further experimental studies for confirmation.
Specific search:
Database source: Raw data from entire PDB header text.
Specific search as the name suggests is meant for users who are searching for detailed information on ANY proteins that they are working on.The search method employed in this system is a sequential text serch system SIGMA [1] that enables very fine searching and editing of texts. Specific Search realises very fine AND/OR/NOT-search with keywords (any strings of symbols/codes) together with searching ability of patterns like KEYWORD1...KEYWORD2 (this finds a text which contains both KEYWORD1 and KEYWORD2 but KEYWORD1 occurs on the left of KEYWORD2 in the text.eg.Box 1:CALMODULIN GAP FLEXIBLE, LINKER without Box 2: "WEAK...DENSITY". These kinds of searches would provide user with some knowledge as to whether the protein is a disordered protein or is a potential candidate. The search also leaves the entire discretion to the user since a non-filtered source of data (PDB) is searched as contrast to the "pre-screened" data in General Search.
Keywords:
- disorder
- gap
- missing
- poorly_ordered
- flexible_linker
- linker
- flexible
- unfolded
- molten_globule
- random_coil
Note:
Although the absence of coordinates in the PDB x-ray data are likely to be disordered regions, certain regions (loops, C- and N-terminals) which may give vague electron densities resulting in missing coordinates would also provide similar reasonings. Likewise the presence of coordinates in certain ambiguous regions (which may be disordered) is not indicative of disorder nature. The likelihood of disorder entries is enhanced with the increased number of keyword scores although this is not absolute and may need experimental confirmation. Our proposed database is by no means conclusive but provides a systematic means for biologists to concentrate on proteins in the PDB which might be considered having the highest possibility of disorder.
Future work:
Future work includes catagorising the entries generated from ProDDO into a more systematic and detailed manner. This will be done by designing views using the HYPOTHESISCREATOR and finding common features (hydropathy, sequence complexity, etc.) in protein entries that share common "keywords". Our preliminary studies on knowledge extraction of disordered regions of proteins from PDB have proved the feasibility of this process [2].
Reason(s) accompanying the chosen keywords are given below.
The reasons stated are meant to aid users in giving different weights to the keywords being used
in their searches.
disorder - The deterministic keyword that would suggest the true nature of disorder in protein structures."Dynamic" and "static" disorders are often reported in X-ray diffration data [3].Both types represent true disorder. However, the keyword is not an absolute indicator.
gap - Missing residues are being reported as "Gap in PDB entry" in the header text.
missing - This keyword appears mainly in the context of missing residues or atoms in the database created here. Missing electron density in an X-ray-
solved protein structure can be due to technical limitations or of the result of structural disorder [3,4].
poorly_ordered - Certain regions of the proteins that may have weak electron density are designated as poorly ordered. This may well reflects the state
of a disorder nature or ambiguity resulted from experimental limitations.
flexible_linker - The keyword "flexible linker" would serve as
a positive filter and is strongly related to disorder in proteins.
linker - This keyword would carry the least weight in providing any indication of disorder in proteins.It is incorporated in the search in order not to eliminate
proteins that might have disorder tendency but being left out by the "flexible linker" keyword search.
flexible - The keyword flexible would suggest "parts" or "regions" (loops, domains, residues etc.) of the protein structures that adopt such feature. Basically this keyword is chosen to cover such parts/regions of the proteins that have no secondary structures and strongly suggest the disorder nature in proteins.
unfolded - 'Unfolded' implies that the region of protein exists in an extended, flexible (random-coil-like) form. These proteins have being found to have biological importance [5,6,7,8]. The terms natively unfolded [9] and natively disordered [10,11] have been suggested to represent disorder in proteins.
molten_globule - A sequence that does not fold into a single unique 3D structure under physiological conditions; can take the form of partially folded like a
molten globule [12].
random_coil - One of the forms that a natively unfolded sequence adopts. The disordered ensemble of structures can involve equilibria between random-coil-
like and molten-globule-like forms.
1.Arikawa, S., et al., "The text database management system SIGMA: an improvement of the main engine", Proc. Berliner Informatik-Tage, 72-81 (1989).
2.Maruyama, O., Uchida, T., Sim, K.L. & Miyano, S., "Designing Views in HypothesisCreator: System for Assisting in Discovery", Lecture Notes in Artificial Intelligence (Proc. First International Conference on Discovery Science), 1532: 105-116, Springer-Verlag, (1998).
3.Bennett, W.S.& Huber, R. "Structural and Functional Aspects of Domain Motions in Proteins", CRC Critical Reviews on Biochemistry, 15(4) : 291-369 (1984).
4. Fehlhammer, H., Bode, W., "The refined crystal structure of bovine ß-trypsin at 1.8 Å resolution. I. Crystallisation, data collection and application of patterson search technique," J. Mol. Biol., 98(4):683-692 (1975).
5. Daughdrill, G.W., Chadsey, M.S., Karlinsey, J.E., Hughes, K.T. & Dahlquist, F.W. The C-Terminal Half of the Anti-sigma Factor, FlgM, Becomes Structured When Bound to Its Target. Nature Struc. Biol. 4, 285-291 (1997).
6. Riek, R., Hornemann, S., Wider, G., Glockshuber, R. & Wührich, K. NMR Characterisation of the Full-length Recombinant Murine Prion Protein mPrP (23-231). FEBS Lett. 413:282-288 (1997).
7. Frankel, A.D. & Kim, P.S. Modular Structure of Transcription Factors: Implications for Gene Regulation. Cell. 65:717-719 (1991).
8. Huth, J.R., Bewley, C.A., Nissen, M.S., Evans, J.N.S., Reeves, R., Gronenborn, A.M. & Clore, G.M. The Solution of an HMG-I(Y)-DNA Complex Defines a New Architectural Minor Groove Binding Motif. Nat. Struc. Biol. 4:657-665 (1997).
9. Weinreb, P.H., Zhen, W., Poon, A.W., Conway, K.A. & Lansbury, Jr. P.T. NACP, a protein implicated in Alzheimer's disease and learning, is natively unfolded. Biochemistry. 35(43):13709-13715(1996).
10. Dunker, A.K., Garner, E., Guilliot, S., Romero, P., Albrecht, K., Hart, J., Obradovic, Z., Kissinger, C., & Villafranca, J.E. Protein Disorder and the Evolution of Molecular Recognition: Theory, Predictions and Observations. Pacific Symposium on Biocomputing, 3:471-482(1998).
11. Garner, E., Cannon, P., Romero, P., Obradovic, Z. & Dunker, A.K. Predicting Disordered Regions from Amino Acid Sequence: Common themes Despite Differing Structural Characterisation. Genome Informatics, 9:201-214(1998).
12. Dolgikh, D.A., Gilmanshin, R.I., Brazhnikov, E.V., Bychkova, V.E., Semisotnov, G.V., Yu, S., & Ptitsyn, O.B. (alpha--Lactalbumin: Compact State with Fluctuating Tertiary Structure? FEBS Lett. 136:311-315, (1981).