|
Home
About Us
eMedicine Search
Drug Development
Feedback
Google Scholar Search
Intranet |
|
Enhanced by
Neuroinformation Computation and Biology (86 References) Polanski, A., A. Bobrowski, et al. (2003). "A note on distributions of times to coalescence, under time-dependent population size." Theor Popul Biol 63(1): 33-40. Expressions for marginal distributions of times in the time-varying coalescence process are derived. The proposed method allows also for computation of joint probability distribution for pairs, triples, etc. of coalescence times. The expressions derived are useful for (1) extending several statistics from time constant to time-varying case, (2) increasing efficiency and accuracy of simulations in time-varying evolution, and (3) debugging coalescence simulation software.
Jacoby, E., J. Davies, et al. (2003). "Design of small molecule libraries for NMR screening and other applications in drug discovery." Curr Top Med Chem 3(1): 11-23. There are conceptual differences between high-throughput screening (HTS) and fragment-based screening by NMR. The number of compounds in libraries for NMR screening may be significantly smaller than those used for HTS. Because one relies on a small library its design is significantly important and is the object of this article. A short introduction on fragment-based NMR screening approaches will be provided. Although there are currently very few reports describing the design of libraries of small molecules for NMR screening, aspects of the question of how to compile diverse collections of small molecular fragments useful for drug design were previously addressed for the purposes of combinatorial library design and de novo drug design. As these disciplines are highly interrelated and are applied in an interconnected manner with NMR screening within the drug discovery process, a review of combinatorial library design and especially the building block or fragment selection strategies applied for combinatorial library design and de novo design is well suited to reveal fundamental strategies and potential techniques for the design of NMR screening libraries. This section will be rounded off by a report on hands-on-experience with the design of the Novartis second-site NMR screening library and practical considerations for the design of compound mixtures. Rather than providing an exact protocol general guidelines will be indicated.
Hood, L. (2003). "Systems biology: integrating technology, biology, and computation." Mech Ageing Dev 124(1): 9-16. The Human Genome Project has changed the worlds of biology and medicine-helping to catalyze two major paradigm changes: systems biology and predictive, preventive and personalized medicine. These two themes will dominate 21st century biology and medicine. I will discuss these changes and indicate how they may interface with with the process of aging.
Haese, A., M. Chaudhari, et al. (2003). "Quantitative biopsy pathology for the prediction of pathologically organ-confined prostate carcinoma: a multiinstitutional validation study." Cancer 97(4): 969-78. BACKGROUND: Quantitative biopsy pathology with prostate specific antigen significantly improves the prediction of pathologic stage in patients with clinically localized prostate carcinoma (PCa). The authors recently reported a computational model for predicting patient specific likelihood of organ confinement of PCa using biopsy pathology and clinical data. The current study validates the initial models and presents an new, improved tool for clinical decision making. METHODS: The authors assessed 10 biopsy pathologic parameters and 2 clinical parameters using data from two institutions. Of 1287 patients, 798 men had pathologically organ confined (OC) PCa, 282 men had nonorgan-confined disease with capsular penetration (NOC-CP) only, and 207 men showed seminal vesicle or lymph node invasion (NOC-AD) after undergoing pelvic lymphadenectomy and radical prostatectomy. Patient input data were evaluated by ordinal logistic (OLOGIT) and neural network (NN) models; and the likelihood of developing OC, NOC-CP, or NOC-AD disease was calculated for the combined and separate data sets and was compared with the results from original presentation. In addition, a new two-output model was constructed (OC/NOC-CP vs. NOC-AD). RESULTS: The three-output OLOGIT and NN models predicted OC disease with 95.0% and 98.6% accuracy, respectively, for the combined data set and with 93.0% and 98.6% accuracy, respectively, on subset analysis. The combined accuracy for predicting OC, NOC-CP, and NOC-AD disease in the entire validation set was 66.7% for OLOGIT model and 66.0% for the NN model. The two-output OLOGIT and NN models correctly predicted 94.9% and 100.0% of all OC/NOC-CP disease, respectively. CONCLUSIONS: Both computation models predicted OC PCa with an accuracy of 93.0-98.6% when they were validated with two different data sets. The OLOGIT and NN-based, two-output model permitted an appropriate treatment decision for 85.2-90.2% of patients. These data support the use of quantitative pathology and clinical data-based decision modeling to manage patients with clinically localized PCa.
Frishman, D., M. Mokrejs, et al. (2003). "The PEDANT genome database." Nucleic Acids Res 31(1): 207-11. The PEDANT genome database (http://pedant.gsf.de) provides exhaustive automatic analysis of genomic sequences by a large variety of established bioinformatics tools through a comprehensive Web-based user interface. One hundred and seventy seven completely sequenced and unfinished genomes have been processed so far, including large eukaryotic genomes (mouse, human) published recently. In this contribution, we describe the current status of the PEDANT database and novel analytical features added to the PEDANT server in 2002. Those include: (i) integration with the BioRS data retrieval system which allows fast text queries, (ii) pre-computed sequence clusters in each complete genome, (iii) a comprehensive set of tools for genome comparison, including genome comparison tables and protein function prediction based on genomic context, and (iv) computation and visualization of protein-protein interaction (PPI) networks based on experimental data. The availability of functional and structural predictions for 650 000 genomic proteins in well organized form makes PEDANT a useful resource for both functional and structural genomics.
Zeigler, B. P. (2002). "The brain-machine disanalogy revisited." Biosystems 64(1-3): 127-40. Michael Conrad was a pioneer in investigating biological information processing. He believed that there are fundamental lessons to be learned from the structure and behavior of biological brains that we are far from understanding or have implemented in our computers. Accumulation of advances in several fields have confirmed his views in broad outline but not necessarily in some of the strong forms he had tried to establish. For example, his assertion that programmable computers are intrinsically incapable of the brain's efficient and adaptive behavior has not received much examination. Yet, this is clearly a direction that could afford much insight into fundamental differences between brain and machine. In this paper, we pay tribute to Michael, by examining his pioneering thoughts on the brain-machine disanalogy in some depth and from the hindsight of a decade later. We argue that as long as we stay within the frame of reference of classical computation, it is not possible to confirm that programmability places a fundamental limitation on computing power, although the resources required to implement a programmable interface leave fewer resources for actual problem-solving work. However, if we abandon the classical computational frame and adopt one in which the user interacts with the system (artificial or natural) in real time, it becomes easier to examine the key attributes that Michael believed place biological brains on a higher plane of capability than artificial ones. While we then see some of these positive distinctions confirmed (e.g. the limitations of symbol manipulation systems in addressing real-world perception problems), we also see attributes in which the implementation in bioware constrains the behavior of real brains. We conclude by discussing how new insights are emerging, that look at the time-bound problem-solving constraints under which organisms have had to survive and how their so-called 'fast and frugal' faculties are tuned to the environments that coevolved with them. These directions open new paths for a multifaceted understanding of what biological brains do and what we can learn from them. We close by suggesting how the discrete event modeling and simulation paradigm offers a suitable medium for exploring these paths.
Zeeberg, B. (2002). "Shannon information theoretic computation of synonymous codon usage biases in coding regions of human and mouse genomes." Genome Res 12(6): 944-55. Exonic GC of human mRNA reference sequences (RefSeqs), as well as A, C, G, and T in codon position 3 are linearly correlated with genomic GC. These observations utilize information from the completed human genome sequence and a large, high-quality set of human and mouse coding sequences, and are in accord with similar determinations published by others. A Shannon Information Theoretic measure of bias in synonymous codon usage was developed. When applied to either human or mouse RefSeqs, this measure is nonlinearly correlated with genomic, exonic, and third codon position A, C, G, and T. Information values between orthologous mouse and human RefSeqs are linearly correlated: mouse = 0.092 + 0.55 human. Mouse genes were consistently placed in genomic regions whose GC content was closer to 50% than was the GC content of the human ortholog. Since the (nonlinear) information versus percent GC curve has a minimum at 50% GC and monotonically increases with increasing distance from 50% GC, this phenomenon directly results in the low slope of 0.55. This appears to be a manifestation of an evolutionary strategy for placement of genes in regions of the genome with a GC content that relates synonymous codon bias and protein folding.
Yin, Z., F. Zhang, et al. (2002). "A Chinese Postman Problem based on DNA computing." J Chem Inf Comput Sci 42(2): 222-4. DNA computing is a novel method for solving a class of intractable computational problems, in which the computing can grow exponentially with the problem size. Up to now, many accomplishments have been achieved to improve its performance and increase its reliability. A Chinese Postman Problem has been solved by means of molecular biology techniques in the paper. A small graph was encoded in molecules of DNA, and the "operations" of the computation were performed with standard protocols and enzymes. This work represents further evidence for the ability of DNA computing to solve NP-complete search problems.
Wongpoowarak, W. and P. Wongpoowarak (2002). "Unified algorithm for real-time detection of drug interaction and drug allergy." Comput Methods Programs Biomed 68(1): 63-72. This algorithm aims at unifying and generalizing the algorithm for detecting all types of documented drug interactions such as drug-drug interactions, drug-disease interactions, drug-patient interactions (drug allergy) from patient profile information and drug-laboratory test interactions in real-time prescribing system. Ideally, the system should conform to the following criteria: (1) data independence; (2) software interconnectability; (3) knowledge expandability; (4) flexibility; and (5) computation resource efficiency. We propose a robust Structured Query Language (SQL) algorithm to detect drug interactions and drug allergy according to such criteria. We believe that this is the first public domain algorithm in SQL that could be easily implemented into most open-system prescribing software which support SQL language. The algorithm comprises two major stages: 'expand' and 'extract'. The former expands all information in the prescription with their synonyms, groups, or components. The latter extracts the documented interactions by inner-joining knowledge-base with two independent copies of the expanded prescription list simultaneously. Simulation study for speed performance indicates that this algorithm is well behaved, for the speed of computation does not grow faster than the growth in prescription size.
Winfree, A. T. (2002). "Chemical waves and fibrillating hearts: discovery by computation." J Biosci 27(5): 465-73.
Vereecken, K. M. and J. F. Van Impe (2002). "Analysis and practical implementation of a model for combined growth and metabolite production of lactic acid bacteria." Int J Food Microbiol 73(2-3): 239-50. Next to the traditional application of lactic acid bacteria (LAB) as starter cultures for food fermentations, the use of LAB as protective cultures against microbial pathogens and spoilage organisms in other food production processes gains more and more interest. The inhibitory effect of LAB is mainly accomplished through formation of antimicrobial metabolites. In this paper, the model of Nicolai et al. [Food Microbiol. 10 (1993) 229.], describing cell growth and production of lactic acid, which is the major end-product of LAB metabolism, is investigated. In contrast to classical predictive models, the transition of the exponential growth phase to the stationary phase is obtained through the increasing concentrations of undissociated lactic acid [LaH] and decreasing pH in the environment. To describe the variation in time of [LaH] and pH, a novel, robust calculation method is introduced. The model of Nicolai et al. in combination with the novel method of [LaH] and pH computation is then further applied to an experimental data set of Lactococcus lactis SL05 grown in a rich medium. An accurate description of the measured values of cell concentration, total lactic acid concentration and pH is obtained.
Varghese, A. and L. M. Boland (2002). "Computing transient gating charge movement of voltage-dependent ion channels." J Comput Neurosci 12(2): 123-37. The opening of voltage-gated sodium, potassium, and calcium ion channels has a steep relationship with voltage. In response to changes in the transmembrane voltage, structural movements of an ion channel that precede channel opening generate a capacitative gating current. The net gating charge displacement due to membrane depolarization is an index of the voltage sensitivity of the ion channel activation process. Understanding the molecular basis of voltage-dependent gating of ion channels requires the measurement and computation of the gating charge, Q. We derive a simple and accurate semianalytic approach to computing the voltage dependence of transient gating charge movement (Q-V relationship) of discrete Markov state models of ion channels using matrix methods. This approach allows rapid computation of Q-V curves for finite and infinite length step depolarizations and is consistent with experimentally measured transient gating charge. This computational approach was applied to Shaker potassium channel gating, including the impact of inactivating particles on potassium channel gating currents.
Tzeremes, G., K. K. Rasmussen, et al. (2002). "Efficient computation of the structural phase behavior of block copolymers." Phys Rev E Stat Nonlin Soft Matter Phys 65(4 Pt 1): 041806. A numerical implementation of self-consistent mean-field theory for the structural phase behavior of block copolymers is proposed. Our scheme does not require a priori assumptions of the underlying mesoscopic symmetries. The method potentially enables us to characterize, with high accuracy, the structural phase diagram of block copolymers with significant architectural complexity. We illustrate the method by applying it to a triblock copolymer system.
Siehs, C., R. Oberbauer, et al. (2002). "Discrete simulation of regulatory homo- and heterodimerization in the apoptosis effector phase." Bioinformatics 18(1): 67-76. MOTIVATION: Quantitative simulation of molecular reaction networks is among the most promising approaches towards an understanding of complex biochemical pathways. Numerous qualitative as well as quantitative data from diverse experimental settings, in particular from genomics and proteomics, have to be contextually linked to convert static data into dynamic functionality. RESULTS: This paper presents the Lattice Molecular Automaton, a Cellular Automaton-based simulation tool, capable of representing complex molecular dynamics at different levels of granularity. A data structure concept represents molecular units, whose dynamics, embedded on a 2D grid, is defined via detailed intermolecular interaction profiles. The data structures hold diverse information as molecular type, potential, as well as kinetic energy states, which allows a precise representation of intracellular reaction networks. The molecular dynamics is performed via local computation of individual molecular states on the lattice, which, in conjunction with discretized space and time, enables excellent scalability of this simulation concept. This paper finally gives Lattice Molecular Automaton simulation results on key elements of apoptosis, the cell death cascade, in particular focusing on the regulatory function of homo- and heterodimerization of members of the Bcl-2 protein family in the apoptosis effector phase. The regulatory proteins Bcl2, Bax, and Bak constitute a diffusion-driven molecular switch with inherent damping of apoptosis induction, thereby controlling the apoptosis reaction cascade under noisy, external apoptosis inducing conditions.
Shapiro, J. A. (2002). "Genome organization and reorganization in evolution: formatting for computation and function." Ann N Y Acad Sci 981: 111-34. This volume deals with the role of epigenetics in life and evolution. The most dynamic forms of functional genome formatting involve DNA interacting with cellular complexes that do not alter sequence information. Such important epigenetic phenomena are the main subjects of other articles in this volume. This article focuses on the long-lived form of genome formatting that lies within the DNA sequence itself. I argue for a computational view of genome function as the long-term information storage organelle of each cell. Structural formatting consists of organizing various signals and coding sequences into computationally ready systems facilitating genome expression and genome transmission. The basic features of genome organization can be understood by examining the E. coli lac operon as a paradigmatic genomic system. Multiple systems are connected through distributed signals and repetitive DNA to form higher-order genome system architectures. Molecular discoveries about mechanisms of DNA restructuring show that cells possess the natural genetic engineering functions necessary for evolutionary change by rearranging genomic components and reorganizing system architectures. The concepts of cellular computation and decision-making, genome system architecture, and natural genetic engineering combine to provide a new way of framing evolutionary theories and understanding genome sequence information.
Rey, S., P. A. Carrupt, et al. (2002). "The Hydrogen-Bond: computational approaches and applications to drug design." Ann Pharm Fr 60(6): 386-96. This mini-review begins with a presentation of the hydrogen-bond in the context of other intermolecular recognition forces of significance in molecular biology and molecular pharmacology. This is followed by a survey of the various computational methods available to calculate the hydrogen-bonding capacity of compounds. Such methods use quantum mechanics, molecular mechanics, or algorithms based on experimental fragmental values. A recent and promising advance in the computation of H-bonding capacity is the development of specific molecular interaction fields (MIFs) known as molecular hydrogen-bonding potentials (MHBPs). Their interest in relating molecular properties to pharmacokinetic behaviour is highlighted with two examples, namely oral drug absorption and blood-brain barrier permeation.
Reifman, J., G. R. Gilbert, et al. (2002). "Military research needs in biomedical informatics." J Am Med Inform Assoc 9(5): 509-19. The 2001 U.S. Army Medical Research and Materiel Command (USAMRMC) Biomedical Informatics Roadmap Meeting was devoted to developing a strategic plan in four focus areas: Hospital and Clinical Informatics, E-Health, Combat Health Informatics, and Bioinformatics and Biomedical Computation. The driving force of this Roadmap Meeting was the recent accelerated pace of change in biomedical informatics in which emerging technologies have the potential to affect significantly the Army research portfolio and investment strategy in these focus areas. The meeting was structured so that the first two days were devoted to presentations from experts in the field, including representatives from the three services, other government agencies, academia, and the private sector, and the morning of the last day was devoted to capturing specific biomedical informatics research needs in the four focus areas. This white paper summarizes the key findings and recommendations and should be a powerful tool for the crafting of future requests for proposals to help align USAMRMC new strategic research investments with new developments and emerging technologies.
Regev, A. and E. Shapiro (2002). "Cells as computation." Nature 419(6905): 343.
Rajasekaran, S., X. Jin, et al. (2002). "The efficient computation of position-specific match scores with the fast fourier transform." J Comput Biol 9(1): 23-33. Historically, in computational biology the fast Fourier transform (FFT) has been used almost exclusively to count the number of exact letter matches between two biosequences. This paper presents an FFT algorithm that can compute the match score of a sequence against a position-specific scoring matrix (PSSM). Our algorithm finds the PSSM score simultaneously over all offsets of the PSSM with the sequence, although like all previous FFT algorithms, it still disallows gaps. Although our algorithm is presented in the context of global matching, it can be adapted to local matching without gaps. As a benchmark, our PSSM-modified FFT algorithm computed pairwise match scores. In timing experiments, our most efficient FFT implementation for pairwise scoring appeared to be 10 to 26 times faster than a traditional FFT implementation, with only a factor of 2 in the acceleration attributable to a previously known compression scheme. Many important algorithms for detecting biosequence similarities, e.g., gapped BLAST or PSIBLAST, have a heuristic screening phase that disallows gaps. This paper demonstrates that FFT algorithms merit reconsideration in these screening applications.
Motrescu, V. C. and U. von Rienen (2002). "Simulation of electromagnetic fields in the human body using Finite Integration Technique (FIT)." Biomed Tech (Berl) 47 Suppl 1 Pt 1: 282-4. In the last years, worldwide scientific communities working in the area of computational biology and medicine have shown an increasing interest to the problem of computation of electromagnetic fields in the human body and safety aspects of human exposure to RF radiation. In this paper we investigate the influence of the frequency of mobile phones upon the electromagnetic energy absorption in tissues and express it as specific absorption rate (SAR) calculated with an algorithm based on Finite Integration Technique for a mobile phone excited with two different frequencies: 900 MHz and 1800 MHz. A three-dimensional human model based on anatomical data, which describes the conductivity distribution in the human body, is used in these simulations.
Mills, A. P., Jr. (2002). "Gene expression profiling diagnosis through DNA molecular computation." Trends Biotechnol 20(4): 137-40. Gene expression profiling is the characterization of cells based on the level of gene activity represented by concentrations of complementary DNA reverse transcribed from messenger RNA. The spectrum of cDNA concentrations, the expression profile, is determined using a DNA microarray. Although this approach is valuable for research, a simpler scheme that would give answers on a shorter time-scale for clinical applications is needed. An Adleman DNA self-assembly computer that would use cDNA as input might be ideal for clinical cell discrimination and a neural network architecture would be appropriate for making the necessary classifications. Preliminary experimental results suggest that expression profiling should be feasible using a DNA neural network that acts directly on cDNA.
Lisker, R., A. Carnevale, et al. (2002). "[Opinions of a group of university students about science and technology]." Rev Invest Clin 54(5): 422-9. OBJECTIVE: To learn the opinions of university students of four different areas on the impact of science and technology on society. SUBJECTS: One Hundred and sixty three close to graduate students of the Universidad Autonoma Metropolitana campus Iztapalapa, distributed as follows: Administration 59, Biology 50, Social Sciences 36 and Engineering 18. METHODS: For the survey we translated into spanish part of a questionnaire employed in several countries to explore ideas on the impact of science and technology on society of several groups. It contained general questions such as. Do you believe that science and technology are equally good or bad to society, or degree of knowledge of several technologies such as computation or in vitro fertilization. It includes also more specific questions, such as would your have problems with the use of genetically modified vegetables? RESULTS AND DISCUSSION: The results suggested that Administration and Social Sciences students had less interest in Science and Technology than the other, and that in general, the knowledge of all students is rather limited including biotechnology, genetic enginering and gene therapy. We compared the results with those obtained previously in a group of Mexican Physicians and Biology students from India, Thailand and Singapor.
Kuruvilla, F. G., P. J. Park, et al. (2002). "Vector algebra in the analysis of genome-wide expression data." Genome Biol 3(3): RESEARCH0011. BACKGROUND: Data from thousands of transcription-profiling experiments in organisms ranging from yeast to humans are now publicly available. How best to analyze these data remains an important challenge. A variety of tools have been used for this purpose, including hierarchical clustering, self-organizing maps and principal components analysis. In particular, concepts from vector algebra have proven useful in the study of genome-wide expression data. RESULTS: Here we present a framework based on vector algebra for the analysis of transcription profiles that is geometrically intuitive and computationally efficient. Concepts in vector algebra such as angles, magnitudes, subspaces, singular value decomposition, bases and projections have natural and powerful interpretations in the analysis of microarray data. Angles in particular offer a rigorous method of defining 'similarity' and are useful in evaluating the claims of a microarray-based study. We present a sample analysis of cells treated with rapamycin, an immunosuppressant whose effects have been extensively studied with microarrays. In addition, the algebraic concept of a basis for a space affords the opportunity to simplify data analysis and uncover a limited number of expression vectors to span the transcriptional range of cell behavior. CONCLUSIONS: This framework represents a compact, powerful and scalable construction for analysis and computation. As the amount of microarray data in the public domain grows, these vector-based methods are relevant in determining statistical significance. These approaches are also well suited to extract biologically meaningful information in the analysis of signaling networks.
Kumar, R. and P. Rockett (2002). "Improved sampling of the pareto-front in multiobjective genetic optimizations by steady-state evolution: a pareto converging genetic algorithm." Evol Comput 10(3): 283-314. Previous work on multiobjective genetic algorithms has been focused on preventing genetic drift and the issue of convergence has been given little attention. In this paper, we present a simple steady-state strategy, Pareto Converging Genetic Algorithm (PCGA), which naturally samples the solution space and ensures population advancement towards the Pareto-front. PCGA eliminates the need for sharing/niching and thus minimizes heuristically chosen parameters and procedures. A systematic approach based on histograms of rank is introduced for assessing convergence to the Pareto-front, which, by definition, is unknown in most real search problems. We argue that there is always a certain inheritance of genetic material belonging to a population, and there is unlikely to be any significant gain beyond some point; a stopping criterion where terminating the computation is suggested. For further encouraging diversity and competition, a nonmigrating island model may optionally be used; this approach is particularly suited to many difficult (real-world) problems, which have a tendency to get stuck at (unknown) local minima. Results on three benchmark problems are presented and compared with those of earlier approaches. PCGA is found to produce diverse sampling of the Pareto-front without niching and with significantly less computational effort.
Kim, S., E. R. Dougherty, et al. (2002). "Strong feature sets from small samples." J Comput Biol 9(1): 127-46. For small samples, classifier design algorithms typically suffer from overfitting. Given a set of features, a classifier must be designed and its error estimated. For small samples, an error estimator may be unbiased but, owing to a large variance, often give very optimistic estimates. This paper proposes mitigating the small-sample problem by designing classifiers from a probability distribution resulting from spreading the mass of the sample points to make classification more difficult, while maintaining sample geometry. The algorithm is parameterized by the variance of the spreading distribution. By increasing the spread, the algorithm finds gene sets whose classification accuracy remains strong relative to greater spreading of the sample. The error gives a measure of the strength of the feature set as a function of the spread. The algorithm yields feature sets that can distinguish the two classes, not only for the sample data, but for distributions spread beyond the sample data. For linear classifiers, the topic of the present paper, the classifiers are derived analytically from the model, thereby providing an enormous savings in computation time. The algorithm is applied to cancer classification via cDNA microarrays. In particular, the genes BRCA1 and BRCA2 are associated with a hereditary disposition to breast cancer, and the algorithm is used to find gene sets whose expressions can be used to classify BRCA1 and BRCA2 tumors.
Jiang, T., G. Lin, et al. (2002). "A general edit distance between RNA structures." J Comput Biol 9(2): 371-88. Arc-annotated sequences are useful in representing the structural information of RNA sequences. In general, RNA secondary and tertiary structures can be represented as a set of nested arcs and a set of crossing arcs, respectively. Since RNA functions are largely determined by molecular confirmation and therefore secondary and tertiary structures, the comparison between RNA secondary and tertiary structures has received much attention recently. In this paper, we propose the notion of edit distance to measure the similarity between two RNA secondary and tertiary structures, by incorporating various edit operations performed on both bases and arcs (i.e., base-pairs). Several algorithms are presented to compute the edit distance between two RNA sequences with various arc structures and under various score schemes, either exactly or approximately, with provably good performance. Preliminary experimental tests confirm that our definition of edit distance and the computation model are among the most reasonable ones ever studied in the literature.
Halfon, M. S., Y. Grad, et al. (2002). "Computation-based discovery of related transcriptional regulatory modules and motifs using an experimentally validated combinatorial model." Genome Res 12(7): 1019-28. Gene expression is regulated by transcription factors that interact with cis-regulatory elements. Predicting these elements from sequence data has proven difficult. We describe here a successful computational search for elements that direct expression in a particular temporal-spatial pattern in the Drosophila embryo, based on a single well characterized enhancer model. The fly genome was searched to identify sequence elements containing the same combination of transcription factors as those found in the model. Experimental evaluation of the search results demonstrates that our method can correctly predict regulatory elements and highlights the importance of functional testing as a means of identifying false-positive results. We also show that the search results enable the identification of additional relevant sequence motifs whose functions can be empirically validated. This approach, combined with gene expression and phylogenetic sequence data, allows for genome-wide identification of related regulatory elements, an important step toward understanding the genetic regulatory networks involved in development.
Grigera, J. R. (2002). "Molecular dynamics simulation for ligand-receptor studies. Carbohydrates interactions in aqueous solutions." Curr Pharm Des 8(17): 1579-604. The review deals with the problem of the study of ligand-receptor interactions and the use of Molecular Dynamics (MD) simulation to approach such a problem. After a short review of the fundamentals of MD we describe the medium in which all biology takes place, water. Emphasis is put on the water models appropriate for simulation of macromolecular systems explicitly including the water molecules. We consider the quality of the water model both in terms of simplicity and performance to describe the liquid water properties. Heavy water, although not a biologically viable medium, is considered since many experiments make use of it as a solvent. Sweetness of carbohydrates is considered as an example of the procedure suitable to characterize active sites on the ligands. Consideration is given to the computation of the binding constants through molecular dynamics. The computation of the Free Energy is described and illustrated. The potentiality of MD for studies of ligand-receptor interactions is limited by the computer resources, for even with large computing facilities the need of relatively long simulation times severely restricts the study of large systems. A method is described in which several shells are treated at different levels of approximation, form mechanical response and mean electrical field to quantum mechanics, through stochastic dynamics and atomic classical MD. The review closes with a brief account of the perspectives of the method.
Gerrish, P. (2002). "Computation biology: evolution plays dice." Nature 420(6917): 756-7.
Fogolari, F., A. Brigo, et al. (2002). "The Poisson-Boltzmann equation for biomolecular electrostatics: a tool for structural biology." J Mol Recognit 15(6): 377-92. Electrostatics plays a fundamental role in virtually all processes involving biomolecules in solution. The Poisson-Boltzmann equation constitutes one of the most fundamental approaches to treat electrostatic effects in solution. The theoretical basis of the Poisson-Boltzmann equation is reviewed and a wide range of applications is presented, including the computation of the electrostatic potential at the solvent-accessible molecular surface, the computation of encounter rates between molecules in solution, the computation of the free energy of association and its salt dependence, the study of pKa shifts and the combination with classical molecular mechanics and dynamics. Theoretical results may be used for rationalizing or predicting experimental results, or for suggesting working hypotheses. An ever-increasing body of successful applications proves that the Poisson-Boltzmann equation is a useful tool for structural biology and complementary to other established experimental and theoretical methodologies.
Fogel, G. B., V. W. Porto, et al. (2002). "Discovery of RNA structural elements using evolutionary computation." Nucleic Acids Res 30(23): 5310-7. RNA molecules fold into characteristic secondary and tertiary structures that account for their diverse functional activities. Many of these RNA structures, or certain structural motifs within them, are thought to recur in multiple genes within a single organism or across the same gene in several organisms and provide a common regulatory mechanism. Search algorithms, such as RNAMotif, can be used to mine nucleotide sequence databases for these repeating motifs. RNAMotif allows users to capture essential features of known structures in detailed descriptors and can be used to identify, with high specificity, other similar motifs within the nucleotide database. However, when the descriptor constraints are relaxed to provide more flexibility, or when there is very little a priori information about hypothesized RNA structures, the number of motif 'hits' may become very large. Exhaustive methods to search for similar RNA structures over these large search spaces are likely to be computationally intractable. Here we describe a powerful new algorithm based on evolutionary computation to solve this problem. A series of experiments using ferritin IRE and SRP RNA stem-loop motifs were used to verify the method. We demonstrate that even when searching extremely large search spaces, of the order of 10(23) potential solutions, we could find the correct solution in a fraction of the time it would have taken for exhaustive comparisons.
Egan, W. J. and G. Lauri (2002). "Prediction of intestinal permeability." Adv Drug Deliv Rev 54(3): 273-89. This review focuses on computational methods for the prediction of passive intestinal permeability. Existing computational models are surveyed and assessed in terms of descriptors, model type/complexity, speed of computation, predictive performance, and interpretability. Challenges to the successful computational prediction of intestinal permeability, i.e. data quantity, measurement imprecision, confounding factors such as solubility, metabolism, or active efflux, and the need for robust statistical methods, are also discussed.
Cotter, P. J., D. R. Caffrey, et al. (2002). "Improved database searches for orthologous sequences by conditioning on outgroup sequences." Bioinformatics 18(1): 83-91. MOTIVATION: Searches of biological sequence databases are usually focussed on distinguishing significant from random matches. However, the increasing abundance of related sequences on databases present a second challenge: to distinguish the evolutionarily most closely related sequences (often orthologues) from more distantly related homologues. This is particularly important when searching a database of partial sequences, where short orthologous sequences from a non-conserved region will score much more poorly than non-orthologous (outgroup) sequences from a conserved region. RESULTS: Such inferences are shown to be improved by conditioning the search results on the scores of an outgroup sequence. The log-odds score for each target sequence identified on the database has the log-odds score of the outgroup sequence subtracted from it. A test group of Caenorhabditis elegans kinase sequences and their identified C.elegans outgroups were searched against a test database of human Expressed Sequence Tag (EST) sequences, where the sets of true target sequences were known in advance. The outgroup conditioned method was shown to identify 58% more true positives ahead of the first false positive, compared to the straightforward search without an outgroup. A test dataset of 151 proteins drawn from the C.elegans genome, where the putative 'outgroup' was assigned automatically, similarly found 50% more true positives using outgroup conditioning. Thus, outgroup conditioning provides a means to improve the results of database searching with little increase in the search computation time.
Colleau, J. J. (2002). "An indirect approach to the extensive calculation of relationship coefficients." Genet Sel Evol 34(4): 409-21. A method was described for calculating population statistics on relationship coefficients without using corresponding individual data. It relied on the structure of the inverse of the numerator relationship matrix between individuals under investigation and ancestors. Computation times were observed on simulated populations and were compared to those incurred with a conventional direct approach. The indirect approach turned out to be very efficient for multiplying the relationship matrix corresponding to planned matings (full design) by any vector. Efficiency was generally still good or very good for calculating statistics on these simulated populations. An extreme implementation of the method is the calculation of inbreeding coefficients themselves. Relative performances of the indirect method were good except when many full-sibs during many generations existed in the population.
Cheng, A., D. J. Diller, et al. (2002). "Computation of the physio-chemical properties and data mining of large molecular collections." J Comput Chem 23(1): 172-83. Very large data sets of molecules screened against a broad range of targets have become available due to the advent of combinatorial chemistry. This information has led to the realization that ADME (absorption, distribution, metabolism, and excretion) and toxicity issues are important to consider prior to library synthesis. Furthermore, these large data sets provide a unique and important source of information regarding what types of molecular shapes may interact with specific receptor or target classes. Thus, the requirement for rapid and accurate data mining tools became paramount. To address these issues Pharmacopeia, Inc. formed a computational research group, The Center for Informatics and Drug Discovery (CIDD).* In this review we cover the work done by this group to address both in silico ADME modeling and data mining issues faced by Pharmacopeia because of the availability of a large and diverse collection (over 6 million discrete compounds) of drug-like molecules. In particular, in the data mining arena we discuss rapid docking tools and how we employ them, and we describe a novel data mining tool based on a ID representation of a molecule followed by a molecular sequence alignment step. For the ADME area we discuss the development and application of absorption, blood-brain barrier (BBB) and solubility models. Finally, we summarize the impact the tools and approaches might have on the drug discovery process.
Brauer, M. J., M. T. Holder, et al. (2002). "Genetic algorithms and parallel processing in maximum-likelihood phylogeny inference." Mol Biol Evol 19(10): 1717-26. We investigated the usefulness of a parallel genetic algorithm for phylogenetic inference under the maximum-likelihood (ML) optimality criterion. Parallelization was accomplished by assigning each "individual" in the genetic algorithm "population" to a separate processor so that the number of processors used was equal to the size of the evolving population (plus one additional processor for the control of operations). The genetic algorithm incorporated branch-length and topological mutation, recombination, selection on the ML score, and (in some cases) migration and recombination among subpopulations. We tested this parallel genetic algorithm with large (228 taxa) data sets of both empirically observed DNA sequence data (for angiosperms) as well as simulated DNA sequence data. For both observed and simulated data, search-time improvement was nearly linear with respect to the number of processors, so the parallelization strategy appears to be highly effective at improving computation time for large phylogenetic problems using the genetic algorithm. We also explored various ways of optimizing and tuning the parameters of the genetic algorithm. Under the conditions of our analyses, we did not find the best-known solution using the genetic algorithm approach before terminating each run. We discuss some possible limitations of the current implementation of this genetic algorithm as well as of avenues for its future improvement.
Bar-Ziv, R., T. Tlusty, et al. (2002). "Protein-DNA computation by stochastic assembly cascade." Proc Natl Acad Sci U S A 99(18): 11589-92. The assembly of RecA on single-stranded DNA is measured and interpreted as a stochastic finite-state machine that is able to discriminate fine differences between sequences, a basic computational operation. RecA filaments efficiently scan DNA sequence through a cascade of random nucleation and disassembly events that is mechanistically similar to the dynamic instability of microtubules. This iterative cascade is a multistage kinetic proofreading process that amplifies minute differences, even a single base change. Our measurements suggest that this stochastic Turing-like machine can compute certain integral transforms.
Aude, J. C. and A. Louis (2002). "An incremental algorithm for Z-value computations." Comput Chem 26(5): 403-11. The Z-value (Comput. Chem. 23 (1999) 333) is an extension of the Z-score that is classically used to compare sets of biological sequences. The Z-value has been successfully used to handle complete genome studies as well as analyze large sets of proteins. The Z-value computation is based on a Monte Carlo approach to estimate the statistical significance of a Smith & Waterman alignment score. Comet et al. (Comput. Chem. 23 (1999) 333) have shown that, in contrast to the alignment score, the Z-value largely reduces the bias due to the lengths and compositions of the sequences. They also described an estimator of the deviation of Z-values, that we extend in this paper in order to optimize Z-values computation. The incremental algorithm described here provides two characteristics which are usually incompatible: (i) it improves the accuracy of Z-values calculation; (ii) it reduces the time complexity (this algorithm has been named incremental because it iteratively adds random sequences to the Monte-Carlo process when needed). Results are presented, originating from the all-by-all comparison of the proteins from Saccharomyces cerevisiae and Escherichia coli.
Zien, A., T. Aigner, et al. (2001). "Centralization: a new method for the normalization of gene expression data." Bioinformatics 17 Suppl 1: S323-31. Microarrays measure values that are approximately proportional to the numbers of copies of different mRNA molecules in samples. Due to technical difficulties, the constant of proportionality between the measured intensities and the numbers of mRNA copies per cell is unknown and may vary for different arrays. Usually, the data are normalized (i.e., array-wise multiplied by appropriate factors) in order to compensate for this effect and to enable informative comparisons between different experiments. Centralization is a new two-step method for the computation of such normalization factors that is both biologically better motivated and more robust than standard approaches. First, for each pair of arrays the quotient of the constants of proportionality is estimated. Second, from the resulting matrix of pairwise quotients an optimally consistent scaling of the samples is computed.
Yue, J. C., M. K. Clayton, et al. (2001). "A nonparametric estimator of species overlap." Biometrics 57(3): 743-9. For two communities, species overlap has been defined by Smith, Solow, and Preston (1996, Biometrics 52, 1472-1477) as the probability that a randomly selected species is present in both communities given that it is present in at least one community. Species overlap can thus be used to describe the similarity of two communities. In contrast with the parametric estimator of Smith et al., we propose a nonparametric maximum likelihood estimator (NPMLE). We prove that the NPMLE is consistent and asymptotically normally distributed and show that computation of the NPMLE and its standard error is straightforward. We also compare the NPMLE and the estimator of Smith et al. for a variety of situations.
Xing, E. P., D. M. Wolf, et al. (2001). "Automatic discovery of sub-molecular sequence domains in multi-aligned sequences: a dynamic programming algorithm for multiple alignment segmentation." J Theor Biol 212(2): 129-39. Automatic identification of sub-structures in multi-aligned sequences is of great importance for effective and objective structural/functional domain annotation, phylogenetic treeing and other molecular analyses. We present a segmentation algorithm that optimally partitions a given multi-alignment into a set of potentially biologically significant blocks, or segments. This algorithm applies dynamic programming and progressive optimization to the statistical profile of a multi-alignment in order to optimally demarcate relatively homogenous sub-regions. Using this algorithm, a large multi-alignment of eukaryotic 16S rRNA was analyzed. Three types of sequence patterns were identified automatically and efficiently: shared conserved domain; shared variable motif; and rare signature sequence. Results were consistent with the patterns identified through independent phylogenetic and structural approaches. This algorithm facilitates the automation of sequence-based molecular structural and evolutionary analyses through statistical modeling and high performance computation.
Weitkunat, R., A. Crispin, et al. (2001). "Standardization of non-aggregated data: theory and practice." Comput Methods Programs Biomed 65(3): 207-27. Direct standardization of rates is a classical method in epidemiology by which the relationship between confounders and variables of analysis is balanced between the samples or populations to be compared. While the method is appropriate if single variables of aggregated data sets are to be compared, it is limited with respect to handling many variables of different measurement scale, especially in primary data situations. Therefore the method of direct standardization is adapted to the analysis of non-aggregated data. Computation of not only rates but other statistics of central tendency and dispersion as well as confidence intervals hence becomes feasible. In addition it is possible to re-adjust the primary data weights in order to perform stratified analyses. A procedure to do so, based solely on the available original weights without the need for analyzing the standard population is proposed along with an extensive SAS macro. The utility allows for the rapid and standardized analysis and tabulation of large multi-variable data sets. Both the statistical and technical properties of the macro are discussed and its usage is explained as well as exemplified.
Wang, J. (2001). "A pseudo-likelihood method for estimating effective population size from temporally spaced samples." Genet Res 78(3): 243-57. A pseudo maximum likelihood method is proposed to estimate effective population size (Ne) using temporal changes in allele frequencies at multi-allelic loci. The computation is simplified dramatically by (1) approximating the multi-dimensional joint probabilities of all the data by the product of marginal probabilities (hence the name pseudo-likelihood), (2) exploiting the special properties of transition matrix and (3) using a hidden Markov chain algorithm. Simulations show that the pseudo-likelihood method has a similar performance but needs much less computing time and storage compared with the full likelihood method in the case of 3 alleles per locus. Due to computational developments, I was able to assess the performance of the pseudo-likelihood method against the F-statistic method over a wide range of parameters by extensive simulations. It is shown that the pseudo-likelihood method gives more accurate and precise estimates of Ne than the F-statistic method, and the performance difference is mainly due to the presence of rare alleles in the samples. The pseudo-likelihood method is also flexible and can use three or more temporal samples simultaneously to estimate satisfactorily the NeS of each period, or the growth parameters of the population. The accuracy and precision of both methods depend on the ratio of the product of sample size and the number of generations involved to Ne, and the number of independent alleles used. In an application of the pseudo-likelihood method to a large data set of an olive fly population, more precise estimates of Ne are obtained than those from the F-statistic method.
Volfovsky, N., B. J. Haas, et al. (2001). "A clustering method for repeat analysis in DNA sequences." Genome Biol 2(8): RESEARCH0027. BACKGROUND: A computational system for analysis of the repetitive structure of genomic sequences is described. The method uses suffix trees to organize and search the input sequences; this data structure has been used previously for efficient computation of exact and degenerate repeats. RESULTS: The resulting software tool collects all repeat classes and outputs summary statistics as well as a file containing multiple sequences (multi fasta), that can be used as the target of searches. Its use is demonstrated here on several complete microbial genomes, the entire Arabidopsis thaliana genome, and a large collection of rice bacterial artificial chromosome end sequences. CONCLUSIONS: We propose a new clustering method for analysis of the repeat data captured in suffix trees. This method has been incorporated into a system that can find repeats in individual genome sequences or sets of sequences, and that can organize those repeats into classes. It quickly and accurately creates repeat databases from small and large genomes. The associated software (RepeatFinder), should prove helpful in the analysis of repeat structure for both complete and partial genome sequences.
van de Warrenburg, B. P. (2001). "[Autosomal dominant cerebellar ataxias in the Netherlands: a national inventory]." Ned Tijdschr Geneeskd 145(20): 962-7. OBJECTIVE: To provide a comprehensive estimate of the number of Dutch autosomal dominant cerebellar ataxias (ADCA) families and patients and thus estimate the minimal prevalence of ADCA in the Netherlands. Furthermore, to observe the relative frequency of SCA mutations and to study genotype-phenotype correlations. DESIGN: Descriptive study and prevalence computation. METHODS: Genotyping was based on cytosine-adenine-guanine (CAG)-repeat expansion detection in the SCA1, SCA2, SCA3, SCA6 and SCA7 genes. We analysed the results of SCA mutation analysis with respect to the number of genotyped families, gene carriers, and clinically affected individuals per SCA locus, as well as individual repeat length and age of onset. Parent-offspring couples were studied for anticipation. The minimal prevalence was extrapolated, based on the observation that at least 36% of ADCA families cannot be genotyped. RESULTS: Per May 1st 2000, 137 Dutch ADCA families were genotyped (SCA1: 15 families; SCA2: 14; SCA3: 64; SCA6: 28; SCA7: 16) and 382 affected individuals had been identified within these families. The extrapolated minimal prevalence was 2.8 per 100,000. All lengths of expanded trinucleotide repeats identified confirmed earlier results regarding pathogenic sizes. Anticipation with an associated increase of repeat length was observed in SCA2 and SCA3 families. The length of the trinucleotide repeat accounted for 65% of the variance in age of onset in SCA3.
Takashima, S. (2001). "The structure and dipole moment of globular proteins in solution and crystalline states: use of NMR and X-ray databases for the numerical calculation of dipole moment." Biopolymers 58(4): 398-409. The large dipole moment of globular proteins has been well known because of the detailed studies using dielectric relaxation and electro-optical methods. The search for the origin of these dipolemoments, however, must be based on the detailed knowledge on protein structure with atomic resolutions. At present, we have two sources of information on the structure of protein molecules: (1) x-ray databases obtained in crystalline state; (2) NMR databases obtained in solution state. While x-ray databases consist of only one model, NMR databases, because of the fluctuation of the protein folding in solution, consist of a number of models, thus enabling the computation of dipole moment repeated for all these models. The aim of this work, using these databases, is the detailed investigation on the interdependence between the structure and dipole moment of protein molecules. The dipole moment of protein molecules has roughly two components: one dipole moment is due to surface charges and the other, core dipole moment, is due to polar groups such as N--H and C==O bonds. The computation of surface charge dipole moment consists of two steps: (A) calculation of the pK shifts of charged groups for electrostatic interactions and (B) calculation of the dipole moment using the pK corrected for electrostatic shifts. The dipole moments of several proteins were computed using both NMR and x-ray databases. The dipole moments of these two sets of calculations are, with a few exceptions, in good agreement with one another and also with measured dipole moments.
Rognes, T. (2001). "ParAlign: a parallel sequence alignment algorithm for rapid and sensitive database searches." Nucleic Acids Res 29(7): 1647-52. There is a need for faster and more sensitive algorithms for sequence similarity searching in view of the rapidly increasing amounts of genomic sequence data available. Parallel processing capabilities in the form of the single instruction, multiple data (SIMD) technology are now available in common microprocessors and enable a single microprocessor to perform many operations in parallel. The ParAlign algorithm has been specifically designed to take advantage of this technology. The new algorithm initially exploits parallelism to perform a very rapid computation of the exact optimal ungapped alignment score for all diagonals in the alignment matrix. Then, a novel heuristic is employed to compute an approximate score of a gapped alignment by combining the scores of several diagonals. This approximate score is used to select the most interesting database sequences for a subsequent Smith-Waterman alignment, which is also parallelised. The resulting method represents a substantial improvement compared to existing heuristics. The sensitivity and specificity of ParAlign was found to be as good as Smith-Waterman implementations when the same method for computing the statistical significance of the matches was used. In terms of speed, only the significantly less sensitive NCBI BLAST 2 program was found to outperform the new approach. Online searches are available at http://dna.uio.no/search/
Robin, S. and S. Schbath (2001). "Numerical comparison of several approximations of the word count distribution in random sequences." J Comput Biol 8(4): 349-59. The exact distribution of word counts in random sequences and several approximations have been proposed in the past few years. The exact distribution has no theoretical limit but may require prohibitive computation time. On the other hand, approximate distributions can be rapidly calculated but, in practice, are only accurate under specific conditions. After making a survey of these distributions, we compare them according to both their accuracy and computational cost. Rules are suggested for choosing between Gaussian approximations, compound Poisson approximation, and exact distribution. This work is illustrated with the detection of exceptional words in the phage Lambda genome.
Rannala, B. and G. Bertorelle (2001). "Using linked markers to infer the age of a mutation." Hum Mutat 18(2): 87-100. Advances in sequencing and genotyping technologies over the last decade have enabled geneticists to easily characterize genetic variation at the nucleotide level. Hundreds of genes harboring mutations associated with genetic disease have now been identified by positional cloning. Using variation at closely linked genetic markers, it is possible to predict the times in the past at which particular mutations arose. Such studies suggest that many of the rare mutations underlying human genetic disorders are relatively young. Studies of variation at genetic markers linked to particular mutations can provide insights into human geographic history, and historical patterns of natural selection and disease, that are not available from other sources. We review two approaches for estimating allele age using variation at linked genetic markers. A phylogenetic approach aims to reconstruct the gene tree underlying a sample of chromosomes carrying a particular mutation, obtaining a "direct" estimate of allele age from the age of the root of this tree. A population genetic approach relies on models of demography, mutation, and/or recombination to estimate allele age without explicitly reconstructing the gene tree. Phylogenetic methods are best suited for studies of ancient mutations, while population genetic methods are better suited for studies of recent mutations. Methods that rely on recombination to infer the ages of alleles can be fine-tuned by choosing linked markers at optimal map distances to maximize the information available about allele age. A limitation of methods that rely on recombination is the frequent lack of a fine-scale linkage map. Maximum likelihood and Bayesian methods for estimating allele age that rely on intensive numerical computation are described, as well as "composite" likelihood and moment-based methods that lead to simple estimators. The former provide more accurate estimates (particularly for large samples of chromosomes) and should be employed if computationally practical.
Poughon, L., C. G. Dussap, et al. (2001). "Energy model and metabolic flux analysis for autotrophic nitrifiers." Biotechnol Bioeng 72(4): 416-33. The behavior of pure cultures of nitrifying microorganisms under autotrophic growth operating conditions was investigated and the relations between their energy metabolism and their anabolism analyzed by means of metabolic network computation. The description of the metabolism of the nitrifiers is extended to their energy metabolism by introducing compartmentalization (cytoplasmic and periplasmic sides) and studying coupling between the electron transport chain and the proton gradient generation. The energy model of Nitrosomonas and Nitrobacter was developed based on the oxidoreduction reactions known to be involved. The electron transport chains and the associated proton translocation for these models are described. Several possible hypotheses are analyzed and discussed concerning the thermodynamic consistency of all the oxidoreduction reactions. For Nitrosomonas, the most delicate point is the second step of hydroxylamine oxidation. For Nitrobacter a new energy model is proposed in which NO plays an important role as node in the distribution of electrons from NO(2)(-) oxidation to the membrane electron transport chain. The compartmentalization enables us to consider a proton gradient dissipation flux as the expression of the overall energy loss in metabolic analysis (the so-called maintenance phenomena). The energy model (electron transport chain, proton gradient) is associated with an overall description of the metabolism of Nitrosomonas and Nitrobacter in terms of metabolic flux calculation. This representation demonstrates that a maintenance in nitrifiers expressed as a proton leak is no higher than for other aerobes. The yields calculated from the energy models integrated with the metabolic models of nitrifiers are consistent with the experimental yields in the literature.
O'Neill, M. A. and C. C. Hilgetag (2001). "The portable UNIX programming system (PUPS) and CANTOR: a computational environment for dynamical representation and analysis of complex neurobiological data." Philos Trans R Soc Lond B Biol Sci 356(1412): 1259-76. Many problems in analytical biology, such as the classification of organisms, the modelling of macromolecules, or the structural analysis of metabolic or neural networks, involve complex relational data. Here, we describe a software environment, the portable UNIX programming system (PUPS), which has been developed to allow efficient computational representation and analysis of such data. The system can also be used as a general development tool for database and classification applications. As the complexity of analytical biology problems may lead to computation times of several days or weeks even on powerful computer hardware, the PUPS environment gives support for persistent computations by providing mechanisms for dynamic interaction and homeostatic protection of processes. Biological objects and their interrelations are also represented in a homeostatic way in PUPS. Object relationships are maintained and updated by the objects themselves, thus providing a flexible, scalable and current data representation. Based on the PUPS environment, we have developed an optimization package, CANTOR, which can be applied to a wide range of relational data and which has been employed in different analyses of neuroanatomical connectivity. The CANTOR package makes use of the PUPS system features by modifying candidate arrangements of objects within the system's database. This restructuring is carried out via optimization algorithms that are based on user-defined cost functions, thus providing flexible and powerful tools for the structural analysis of the database content. The use of stochastic optimization also enables the CANTOR system to deal effectively with incomplete and inconsistent data. Prototypical forms of PUPS and CANTOR have been coded and used successfully in the analysis of anatomical and functional mammalian brain connectivity, involving complex and inconsistent experimental data. In addition, PUPS has been used for solving multivariate engineering optimization problems and to implement the digital identification system (DAISY), a system for the automated classification of biological objects. PUPS is implemented in ANSI-C under the POSIX.1 standard and is to a great extent architecture- and operating-system independent. The software is supported by systems libraries that allow multi-threading (the concurrent processing of several database operations), as well as the distribution of the dynamic data objects and library operations over clusters of computers. These attributes make the system easily scalable, and in principle allow the representation and analysis of arbitrarily large sets of relational data. PUPS and CANTOR are freely distributed (http://www.pups.org.uk) as open-source software under the GNU license agreement.
Niederberger, C. (2001). "Neural computation in urology: an orientation." Mol Urol 5(4): 133-9. Neural computation is a field in which mathematical models are derived from algorithms based loosely on the physiological function of the biological neuron. This paper serves as an introduction to those that follow in this issue of Molecular Urology, which describe actual applications of neural computational modeling in the urologic domain. In this introductory paper, the history of computer technology and the foundations of neural computation are discussed. Methods of determining the accuracy of computation models are reviewed, and a statistical method of evaluating the significance of individual input features to the model's output, a process known as "feature extraction," is presented. Resources that provide free and commercial neural computational programs are cited for those readers interested in applying this technology to their own datasets, and a brief description of the author's neural computational programming environment is included. Finally, deployment of computational models via the Internet and various computer platforms is discussed.
Levchenko, A. (2001). "Computational cell biology in the post-genomic era." Mol Biol Rep 28(2): 83-9. Rapid accumulation of biological data from novel high throughput technologies characteristic of genomic and proteomic research as well as advances in more traditional biological disciplines are leading to wider use of detailed and complex computational models of cell behavior. These models address a variety of dynamic intracellular processes ranging from interactions within a gene regulation network to intracellular and intercellular signal transduction. This review focuses on the current trends in computation cell biology, particularly emphasizing the role of experimental validation. The recent successes and future challenges facing computational cell biology are also discussed.
Joslyn, C. (2001). "The semiotics of control and modeling relations in complex systems." Biosystems 60(1-3): 131-48. We provide a conceptual analysis of ideas and principles from the systems theory discourse which underlie Pattee's semantic or semiotic closure, which is itself foundational for a school of theoretical biology derived from systems theory and cybernetics, and is now being related to biological semiotics and explicated in the relational biological school of Rashevsky and Rosen. Atomic control systems and models are described as the canonical forms of semiotic organization, sharing measurement relations, but differing topologically in that control systems are circularly and models linearly related to their environments. Computation in control systems is introduced, motivating hierarchical decomposition, hybrid modeling and control systems, and anticipatory or model-based control. The semiotic relations in complex control systems are described in terms of relational constraints, and rules and laws are distinguished as contingent and necessary functional entailments, respectively. Finally, selection as a meta-level of constraint is introduced as the necessary condition for semantic relations in control systems and models.
He, Y., X. Xu, et al. (2001). "Polylink: to support two-point linkage analysis in autotetraploids." Bioinformatics 17(8): 740-1. SUMMARY: Polylink runs under Microsoft Windows (95 or later). It performs various calculations that are useful for investigating two-point linkage analysis for autopolyploids, based on the random chromosome pairing model. These include calculation of offspring phenotypic probabilities as functions of the recombination fraction, calculation of theoretical standard errors for the maximum likelihood estimator of and numerical computation of maximum likelihood estimates. It also includes simulation facilities. AVAILABILITY: Polylink is free and available from Xiangming Xu via email
Glasspool, D. W., J. Fox, et al. (2001). "Risk assessment in genetics: a semi-quantitative approach." Medinfo 10(Pt 1): 459-63. Public awareness of genetic predisposition to diseases such as breast cancer threatens to put severe strain on genetics services. Computer-based decision support for general practitioners (GPs) has the potential to reduce unnecessary referrals, but issues of communicating about levels of risk and uncertainty must be addressed. An argumentation logic formalism can subsume both traditional probability theory and more qualitative, reason-based approaches to communicating uncertainty, and we propose that qualitative, argument-based presentation will make uncertainty information more accessible and comprehensible to both patient and GP. We describe software that uses an argumentation approach to assess genetic risk during a GP consultation and provide referral advice along with detailed qualitative explanations for its advice. The software was evaluated in real-life GP consultations in which actors played patients concerned about genetic risk, and in use by GPs evaluating simulated cases. Significant improvement in accuracy of assessment and appropriateness of referrals was found. GPs viewed the software and the qualitative reporting approach highly favourably.
Foster, J. A. (2001). "Evolutionary computation." Nat Rev Genet 2(6): 428-36. Evolution does not require DNA, or even living organisms. In computer science, the field known as 'evolutionary computation' uses evolution as an algorithmic tool, implementing random variation, reproduction and selection by altering and moving data within a computer. This harnesses the power of evolution as an alternative to the more traditional ways to design software or hardware. Research into evolutionary computation should be of interest to geneticists, as evolved programs often reveal properties - such as robustness and non-expressed DNA - that are analogous to many biological phenomena.
Fearnhead, P. (2001). "Perfect simulation from population genetic models with selection." Theor Popul Biol 59(4): 263-79. We consider using the ancestral selection graph (ASG) to simulate samples from population genetic models with selection. Currently the use of the ASG to simulate samples is limited. This is because the computational requirement for simulating samples increases exponentially with the selection rate and also due to needing to simulate a sample of size one from the population at equilibrium. For the only case where the distribution of a sample of size one is known, that of parent-independent mutations, more efficient simulation algorithms exist. We will show that by applying the idea of coupling from the past to the ASG, samples can be simulated from a general K-allele model without knowledge of the distribution of a sample of size one. Furthermore, the computation involved in generating such samples appears to be less than that of simulating the ASG until its ultimate ancestor. In particular, in the case of genic selection with parent-independent mutations, the computational requirement increases only quadratically with the selection rate. The algorithm is demonstrated by simulating samples at a microsatellite locus.
Diaz-Sierra, R. and V. Fairen (2001). "Simplified method for the computation of parameters of power-law rate equations from time-series." Math Biosci 171(1): 1-19. Modeling biological processes from time-series data is a resourceful procedure which has received much attention in the literature. For models established in the context of non-linear differential equations, parameter-dependent phenomenological tentative response functions are tested by comparing would-be solutions of those models to the experimental time-series. Those values of the parameters for which a tested solution is a best fit are then retained. It is done with the help of some appropriate optimization algorithm which simplifies the searching procedure within the range of variability of the parameters that are to be estimated. The procedure works well in problems with a small number of adjustable parameters or/and with narrow searching ranges. However, it may start to be problematic for models with a large number of problem parameters inasmuch as convergence to the best fit is not necessarily ensured. In this case, a reduction in size of the parameter estimation problem must be undertaken. We presently address this issue by proposing a systematic procedure that does so in problems in which the system's response to a sufficiently small pulse perturbation of steady-state can be obtained. The response is then assumed to be a solution of the linearized equations, the Jacobian of which can be retrieved by a simple multilinear regression. The calculated n(2) Jacobian entries provide as many relationships among problem parameters, thus cutting substantially the size of the starting problem. After this preliminary treatment is applied, only (kappa-n(2)) of the initial kappa adjustable parameters are left for evaluation by means of a non-linear optimization procedure. The benefits of the present variant are both in economy of computation and in accuracy in determining the parameter values. The performance of the method is established under different circumstances. It is illustrated in the context of power-law rates, although this does not preclude its applicability to more general functional responses.
Cox, J. C. and A. D. Ellington (2001). "DNA computation function." Curr Biol 11(9): R336.
Coulson, A. S., D. W. Glasspool, et al. (2001). "RAGs: A novel approach to computerized genetic risk assessment and decision support from pedigrees." Methods Inf Med 40(4): 315-22. OBJECTIVES: To assist general practitioners in evaluating patients' genetic risk of cancer on the basis of family history data. METHODS: A new computer application, RAGs (Risk Assessment in Genetics), has been developed to help doctors create graphical family trees and assess the genetic risk of breast and colorectal cancer. RAGs possesses two features that distinguish it from similar software: (i) a user-centred design, which takes into account the requirements of the doctor-patient encounter; (ii) effective and accessible risk reporting by employing qualitative evidence for or against increased risk, which is more easily understood than numerical probabilities. The system allows any rule-based genetic risk guideline to be implemented, and may be readily modified to cater for the varying degrees of information required by different specialists. RESULTS: RAGs permits fast, accurate data entry, and results in more appropriate management decisions than those made via other techniques. In addition, RAGs enables both the clinician and the patient to understand how it arrives at its conclusions, since the use of qualitative evidence allows the program to provide explanations for its reasoning. CONCLUSIONS: The RAGs system promises to help practitioners be more effective gatekeepers to genetic services. It may empower doctors both to make an informed choice when deciding to refer patients who are at increased genetic risk of breast or colorectal cancer, and to reassure those who are at low risk.
Clerget-Darpoux, F. (2001). "Extension of the lod score: the mod score." Adv Genet 42: 115-24. In 1955 Morton proposed the lod score method both for testing linkage between loci and for estimating the recombination fraction between them. If a disease is controlled by a gene at one of these loci, the lod score computation requires the prior specification of an underlying model that assigns the probabilities of genotypes from the observed phenotypes. To address the case of linkage studies for diseases with unknown mode of inheritance, we suggested (Clerget-Darpoux et al., 1986) extending the lod score function to a so-called mod score function. In this function, the variables are both the recombination fraction and the disease model parameters. Maximizing the mod score function over all these parameters amounts to maximizing the probability of marker data conditional on the disease status. Under the absence of linkage, the mod score conforms to a chi-square distribution, with extra degrees of freedom in comparison to the lod score function (MacLean et al., 1993). The mod score is asymptotically maximum for the true disease model (Clerget-Darpoux and Bonaiti-Pellie, 1992; Hodge and Elston, 1994). Consequently, the power to detect linkage through mod score will be highest when the space of models where the maximization is performed includes the true model. On the other hand, one must avoid overparametrization of the model space. For example, when the approach is applied to affected sibpairs, only two constrained disease model parameters should be used (Knapp et al., 1994) for the mod score maximization. It is also important to emphasize the existence of a strong correlation between the disease gene location and the disease model. Consequently, there is poor resolution of the location of the susceptibility locus when the disease model at this locus is unknown. Of course, this is true regardless of the statistics used. The mod score may also be applied in a candidate gene strategy to model the potential effect of this gene in the disease. Since, however, it ignores the information provided both by disease segregation and by linkage disequilibrium between the marker alleles and the functional disease alleles, its power of discrimination between genetic models is weak. The MASC method (Clerget-Darpoux et al., 1988) has been designed to address more efficiently the objectives of a candidate gene approach.
Clark, J. S., S. R. Carpenter, et al. (2001). "Ecological forecasts: an emerging imperative." Science 293(5530): 657-60. Planning and decision-making can be improved by access to reliable forecasts of ecosystem state, ecosystem services, and natural capital. Availability of new data sets, together with progress in computation and statistics, will increase our ability to forecast ecosystem change. An agenda that would lead toward a capacity to produce, evaluate, and communicate forecasts of critical ecosystem services requires a process that engages scientists and decision-makers. Interdisciplinary linkages are necessary because of the climate and societal controls on ecosystems, the feedbacks involving social change, and the decision-making relevance of forecasts.
Zien, A., R. Zimmer, et al. (2000). "A simple iterative approach to parameter optimization." J Comput Biol 7(3-4): 483-501. Various bioinformatics problems require optimizing several different properties simultaneously. For example, in the protein threading problem, a scoring function combines the values for different parameters of possible sequence-to-structure alignments into a single score to allow for unambiguous optimization. In this context, an essential question is how each property should be weighted. As the native structures are known for some sequences, a partial ordering on optimal alignments to other structures, e.g., derived from structural comparisons, may be used to adjust the weights. To resolve the arising interdependence of weights and computed solutions, we propose a heuristic approach: iterating the computation of solutions (here, threading alignments) given the weights and the estimation of optimal weights of the scoring function given these solutions via systematic calibration methods. For our application (i.e., threading), this iterative approach results in structurally meaningful weights that significantly improve performance on both the training and the test data sets. In addition, the optimized parameters show significant improvements on the recognition rate for a grossly enlarged comprehensive benchmark, a modified recognition protocol as well as modified alignment types (local instead of global and profiles instead of single sequences). These results show the general validity of the optimized weights for the given threading program and the associated scoring contributions.
Wickware, P. (2000). "Next-generation biologists must straddle computation and biology." Nature 404(6778): 683-4.
Wheeler, R. and R. Hughey (2000). "Optimizing reduced-space sequence analysis." Bioinformatics 16(12): 1082-90. MOTIVATION: Dynamic programming is the core algorithm of sequence comparison, alignment and linear hidden Markov model (HMM) training. For a pair of sequence lengths m and n, the problem can be solved readily in O(mn)time and O(mn)space. The checkpoint algorithm introduced by Grice et al. (CABIOS, 13, 45--53, 1997) runs in O(Lmn)time and O(Lm(L) square root of n)space, where L is a positive integer determined by m, n, and the amount of available workspace. The algorithm is appropriate for many string comparison problems, including all-paths and single-best-path hidden Markov model training, and is readily parallelizable. The checkpoint algorithm has a diagonal version that can solve the single-best-path alignment problem in O(mn)time and O(m + n)space. RESULTS: In this work, we improve performance by analyzing optimal checkpoint placement. The improved row checkpoint algorithm performs up to one half the computation of the original algorithm. The improved diagonal checkpoint algorithm performs up to 35% fewer computational steps than the original. We modified the SAM hidden Markov modeling package to use the improved row checkpoint algorithm. For a fixed sequence length, the new version is up to 33% faster for all-paths and 56% faster for single-best-path HMM training, depending on sequence length and allocated memory. Over a typical set of protein sequence lengths, the improvement is approximately 10%.
Sierra, A. and F. Corbacho (2000). "Reclassification as supervised clustering." Neural Comput 12(11): 2537-46. In some branches of science, such as molecular biology, classes may be defined but not completely trusted. Sometimes posterior analysis proves them to be partially incorrect. Despite its relevance, this phenomenon has not received much attention within the neural computation community. We define reclassification as the task of redefining some given classes by maximum likelihood learning in a model that contains both supervised and unsupervised information. This approach leads to supervised clustering with an additional complexity penalizing term on the number of new classes. As a proof of concept, a simple reclassification algorithm is designed and applied to a data set of gene sequences. To test the performance of the algorithm, two of the original classes are merged. The algorithm is capable of unraveling the original three-class hidden structure, in contrast to the unsupervised version (K-means); moreover, it predicts the subdivision of one of the original classes into two different ones.
Sakamoto, K., H. Gouzu, et al. (2000). "Molecular computation by DNA hairpin formation." Science 288(5469): 1223-6. Hairpin formation by single-stranded DNA molecules was exploited in a DNA-based computation in order to explore the feasibility of autonomous molecular computing. An instance of the satisfiability problem, a famous hard combinatorial problem, was solved by using molecular biology techniques. The satisfiability of a given Boolean formula was examined autonomously, on the basis of hairpin formation by the molecules that represent the formula. This computation algorithm can test several clauses in the given formula simultaneously, which could reduce the number of laboratory steps required for computation.
Reinert, K., J. Stoye, et al. (2000). "An iterative method for faster sum-of-pairs multiple sequence alignment." Bioinformatics 16(9): 808-14. MOTIVATION: Multiple sequence alignment is an important tool in computational biology. In order to solve the task of computing multiple alignments in affordable time, the most commonly used multiple alignment methods have to use heuristics. Nevertheless, the computation of optimal multiple alignments is important in its own right, and it provides a means of evaluating heuristic approaches or serves as a subprocedure of heuristic alignment methods. RESULTS: We present an algorithm that uses the divide-and-conquer alignment approach together with recent results on search space reduction to speed up the computation of multiple sequence alignments. The method is adaptive in that depending on the time one wants to spend on the alignment, a better, up to optimal alignment can be obtained. To speed up the computation in the optimal alignment step, we apply the alpha(*) algorithm which leads to a procedure provably more efficient than previous exact algorithms. We also describe our implementation of the algorithm and present results showing the effectiveness and limitations of the procedure.
Pennisi, E. (2000). "Genomics. Wellcome Trust backs genome computation." Science 289(5479): 525.
Pena-Reyes, C. A. and M. Sipper (2000). "Evolutionary computation in medicine: an overview." Artif Intell Med 19(1): 1-23. The term evolutionary computation encompasses a host of methodologies inspired by natural evolution that are used to solve hard problems. This paper provides an overview of evolutionary computation as applied to problems in the medical domains. We begin by outlining the basic workings of six types of evolutionary algorithms: genetic algorithms, genetic programming, evolution strategies, evolutionary programming, classifier systems, and hybrid systems. We then describe how evolutionary algorithms are applied to solve medical problems, including diagnosis, prognosis, imaging, signal processing, planning, and scheduling. Finally, we provide an extensive bibliography, classified both according to the medical task addressed and according to the evolutionary technique used.
Mao, C., T. H. LaBean, et al. (2000). "Logical computation using algorithmic self-assembly of DNA triple-crossover molecules." Nature 407(6803): 493-6. Recent work has demonstrated the self-assembly of designed periodic two-dimensional arrays composed of DNA tiles, in which the intermolecular contacts are directed by 'sticky' ends. In a mathematical context, aperiodic mosaics may be formed by the self-assembly of 'Wang' tiles, a process that emulates the operation of a Turing machine. Macroscopic self-assembly has been used to perform computations; there is also a logical equivalence between DNA sticky ends and Wang tile edges. This suggests that the self-assembly of DNA-based tiles could be used to perform DNA-based computation. Algorithmic aperiodic self-assembly requires greater fidelity than periodic self-assembly, because correct tiles must compete with partially correct tiles. Here we report a one-dimensional algorithmic self-assembly of DNA triple-crossover molecules that can be used to execute four steps of a logical (cumulative XOR) operation on a string of binary bits.
Lermen, M. and K. Reinert (2000). "The practical use of the A* algorithm for exact multiple sequence alignment." J Comput Biol 7(5): 655-71. Multiple alignment is an important problem in computational biology. It is well known that it can be solved exactly by a dynamic programming algorithm which in turn can be interpreted as a shortest path computation in a directed acyclic graph. The A* algorithm (or goal-directed unidirectional search) is a technique that speeds up the computation of a shortest path by transforming the edge lengths without losing the optimality of the shortest path. We implemented the A* algorithm in a computer program similar to MSA (Gupta et al., 1995) and FMA (Shibuya and Imai, 1997). We incorporated in this program new bounding strategies for both lower and upper bounds and show that the A* algorithm, together with our improvements, can speed up computations considerably. Additionally, we show that the A* algorithm together with a standard bounding technique is superior to the well-known Carrillo-Lipman bounding since it excludes more nodes from consideration.
Kazic, T. (2000). "Semiotes: a semantics for sharing." Bioinformatics 16(12): 1129-44. MOTIVATION: Reliable, automated communication of biological information requires methods to declare the information's semantics. In this paper I describe an approach to semantic declaration intended to permit independent, distributed databases, algorithms, and servers to exchange and process requests for information and computations without requiring coordination or agreement among them on universe of discourse, data model, schema, or implementation. RESULTS: This approach uses Glossa, a formal language defining the semantics of biological ideas, information, and algorithms, to executably define the semantics of complex ideas and computations by constructs of semiotes, terms which axiomatically define very simple notions. A database or algorithm wishing to exchange information or computations maintains a set of mappings between its particular notions and semiotes, and a parser to translate between its indigenous ideas and implementation and the semiotes. Requests from other databases or algorithms are issued as semiotic messages, locally interpreted and processed, and the results returned as semiotes to the requesting entity. Thus, semiotes serve as a shared, abstract layer of definitions which can be computably combined by each database or algorithm according to its own needs and ideas. By combining the explicit declaration of semantics with the computation of the semantics of complex ideas, Glossa and its semiotes permit independent computational entities to lightly federate their capabilities as desired while maintaining their unique perspectives on both scientific and technical questions.
Gerstein, G. L. (2000). "Cross-correlation measures of unresolved multi-neuron recordings." J Neurosci Methods 100(1-2): 41-51. An increasing number of laboratories are studying population properties of the nervous system using data where the spike activity of more than one neuron is recorded on each electrode and where, accidentally or deliberately, these activities are not resolved into single unit spike trains. We have previously examined the consequences for measurement of cross-correlation between two such electrodes in the limited case where all individual distant (between electrode) correlations are the same and all individual clo |