From omics data to testable biological hypotheses through network statistical analysis.
Johannes Kepler published his three laws of planetary motion in the years of 1609 and 1619. Today they are part of the foundation of astronomy. His intellectual triumph was only made possible by the adoption of Copernicus' revolutionary hypothesis, Tycho Brahe's astronomical data, and the patronage of the Holy Roman Emperors. How was it though that Kepler came to recognize these celestial regularities that went unnoticed by the successive genuises of antiquity? After all, Aristarchus championed heliocentrism, Ptolemy compiled star catologues and planetary tables, and the elites of society patronized the sciences in much the same way as their Renaissance counterparts.
The answer is that Kepler worked with Brahe's comparatively accurate and comprehensive data tables. And it is extremely unlikely to say the least that Kepler would have ever conceived of his theory of nonuniform elliptical motion without the benefit of Brahe's work. Brahe was able to improve on Ptolemy's planetary tables because he observed the heavens with superior scientific instruments and employed a newly devised computational technique, with the curious name of prosthaphaeresis, to perform the huge number of multiplications needed to produce much of his astronomical data. Presupposing a scientific outlook in the first place, according to my count the factors underlying Kepler's discovery are five in number:
The pattern of events leading up to the discovery of Kepler's laws is a story that has been repeated time and again in science. In the present age, we find ourselves in the midst of a similarly unfolding story with cancer. Cancer biologists working in late 20^{th} century laid the foundation for a revolution in the understanding of the molecular and cellular bases of the disease. Since that time advances in sequencing technology have conferred the present generation of cancer biologists with the means to produce genome-scale, or omics, datasets that quantify practically all known biomolecules in humans and other model organisms. These new scientific instruments provide an unprecedented ability to fill in the details of the emerging mechanistic picture of how cancer functions. And today massively funded international efforts are underway to do just this with the expectation that some of the findings may eventually prove useful to those who are working on developing new therapies.
What is the role of the bioinformatician in this enterprise?
In a phrase the role of the bioinformatician is to develop methods and software tools for processing and making sense of genomic data. It is hardly a glamorous role, but in the arena of 21^{th} century cancer research it is essential. To appreciate why we may return to Kepler. In the first place, Brahe would have found it impossible to produce his datasets without the prosthaphaeresis algorithm. Second, Kepler persevered for the better part of a decade fitting models to the data before he hit on the right one. The bioinformatician in cancer genomics works in the background to develop modern analogs of prosthaphaeresis and help speed up the process of discovery for the Keplers of today in cancer biology.
Where do I come in?
My role in all this is to direct my knowledge of probability and statistics to the development of methods that will accelerate the acquisition of knowledge on how cancer works at the molecular level. More specifically, I am actively involved in the development of computational techniques to extract testable biological hypotheses from heterogeneous and often large omics datasets drawn from genomics, transcriptomics, proteomics, and metabolomics. This is a flourishing area of research in bioinformatics that is rife with conceptual and practical problems pertaining to data analysis, integration, and visualisation. What distinguishes me is my background in the field of complex networks. For me, the network provides a fruitful conceptual framework for making sense of the bewilderingly complex world of cancer, just as heliocentrism did for Kepler. This fusion of statistical estimation with the theory of complex networks is called, at least by me, network statistical analysis. It is with this outlook that I explore ways to go from omics data to testable biological hypotheses that are of practical use to molecular biologists working to understand cancer.
2011—Present
In the time since Seiya Imoto took me on as a postdoc in Satoru Miyano's Laboratory of DNA Information Analysis at The University of Tokyo, the better part of my work has surrounded the analysis, integration, and visualisation of omics data. My stay here has been both pleasant and productive, and after some initial efforts to get my footing in genome informatics, my research has really begun to take shape. Before starting into a survey of my past and present projects, however, I feel obliged to express a few passing thoughts concerning my impressions on the nature of statistical research in the academic world.
In short, there are two kinds of statistician: first, there is the factory floor statistician who works with scientists on the kinds of concrete inferential problems that routinely crop up in the course of analysing experimental data; second, there is the desk worker statistician who devises statistical techniques based on pet theories with the expectation that their application to previously scrutinized or future data will lead to new discoveries. The garden-variety statistician is usually some almagam of these caricatures. Finding the right balance between the two is a matter of protracted discussion that I do not want to enter into here. The answer, in any event, will depend on individual taste and circumstance. Instead, I merely wish to call attention to what seems to me to be their more evident merits and demerits.
For the statistician on the factory floor, the invention of a new technique is a matter of strict necessity that proceeds from the bottom up by generalising from novel and often unanticipated statistical problems encountered in the lab. In fact, every major statistical technique that I can think of has been arrived at in this manner. Take the example of princinpal component analysis. It was invented by Francis Galton while analysing multivariate data on heritable human traits. The reason that Galton came up with it, and not someone else, is because he was the first to work on multivariate data with the new statistically minded outlook that prevailed across England in the late 19^{th} century. Any other competent statistician put in Galton's place would have thought of the same thing. On the other hand, numerous statisticians since Galton's day have analysed multivariate data without ever having added to the toolbox of statistical techniques. This shows that to thrive the statistician on the factory floor must be lucky on top of being competent. Should he or she be fourunate enough to be confronted with a novel statistical problem to solve after getting first dibs at analyzing some new kind of data, then he or she is on track to becomming a full professor with a compliment of graduate students and postdocs to manage; if not, then he or she is certain to repeat the routine analyses found in the textbooks with minor variations.
Most statisticians are no where near as lucky as was Galton when it comes to having privileged access to novel forms of data. The usual recourse of the data have-not statistician is to hunker down at a desk and start cooking up incremental refinements to established statistical techniques or else tailor them for use in particular applications. Research of this kind, although admitedly not very glamorous, is essential for the uninterrupted advancement of science in the modern age. Moreover, this path affords the statistician with reasonable hopes of securing a stable academic position, because with a little hard work and determination it is possible to publish a great deal of papers. In pursuing this path, though, it is easy to get carried away in the production of a large body of overly specialized and technical work that serves no credible scientific purpose. This happens primarily, I think, because the self-generating capacity inherent to research presently operates within an instutional structure in which advancement is too often decided by publication count. As a consequence, researchers end up having to concentrate on short-term, low-risk work that primarily functions to pad publication lists or secure grant money. One former colleague of mine, who will remain unnamed, succeeded in publishing 25 papers in the space of a year by carrying on in this manner. This is at the expense of investigating more important long-term and/or high-risk/high-reward questions. Unless one is prepared to risk getting selected out of the system early on that sort of work must usual wait until lifetime employment has been secured.
For what it's worth, I try my best to never stray too far from the factory floor, by keeping up a healthy number of collaborations with biologists in the wetlab. At the same time, however, I am mindful that life in the wetlab is a precarious form of existence for the bioinformatician. If I'm lucky, I'll encounter a problem in the course of a data analysis requiring a novel statisitcal approach to resolve; if not, as is frequently the case, then I'll discover that a satisfactory resolution to the problem already exists. I balance this volatile side of my research, by developing new statistical techniques and tools that my experience suggests will be helpful to experimental biologists. At the same time I try to leave a decent amont of time for the investigation of high-risk questions on the fringe of science.
This project is very much in keeping with the desk worker approach to bioinformatics. In cancer research, the correlation between the expression of a gene and patient survival time can provide insights into the mechanisms underlying malignancy that any good molecular biologist can frame as a working hypothesis to test by experiment. A handful of searchable databases, such as PrognoScan, are already available to scour the vast stores of publicly accessible gene expression data with clinical annotation for these sorts of correlations. And I have routinely found the analyses offered by these resources to be invaluable in my experience of working with biologists in the wetlab.
It was in this context that my colleague Atsushi Niida and I hit upon the idea of developing a bioinformatical methodology that generalizes the above scheme from single genes to genesets, that is, groups of functionally related genes. Case studies we examined show the geneset approach uncovers biologically meaningful correlations that fall outside the purview of the conventional single-gene analysis paradigm. I am presently implementing and evaluating our methodology with a view toward setting up a web-based application and associated database.
At the same time I am working on the visualisation end of things together with our resident data visualisation aficionado Georg Tremmel. We are employing what we call weighted Voronoi treemaps, which, incidentally, constitutes a novel theoretical contribution in its own right, to make the output of our statistical analyses intelligible to a general audience. The details at this point are kept purposely glib to hamper the efforts of any unscruplous readers out there intent on running off with our ideas.
This project, in contrast to the one just above, transpired very much on the factory floor. I worked with Koshi Mimori's team in the Department of Surgery at Kyushu University Beppu Hospital as the primary bioinformatician on an integrative omics data analysis of colorectal cancer tumor samples. The main objective of our work was to discover novel molecular pathways promoting tumorigenesis in colon cancer that could be targeted by future thearpies. Through a combination of statistical analysis and experimental work, we proposed AURKA and TPX2 as co-regulators on the MYC pathway.
The discovery that AURKA and TPX2 are bound up with MYC carries immediate therapeutic implications. The trouble is that while the MYC oncogene has long been established as a driver in many cancers, the realization of MYC-targeting therapies has remained unrealized. Inhibiting the AURKA/TPX2 axis could, however, prove to be a novel therapeutic approach to MYC-driven cancers, because MYC interacts in a synthetically lethal manner with both AURKA and TPX2. A synthetic lethal therapeutic approach aims to kill MYC-driven tumors by targeting a selected co-regulator on the MYC pathway.
Representative PublicationsAlan Turing's onetime assistant I. J. Good published the now classical paper entitled The poulation frequencies of species and the estimation of population parameters in a 1953 volume of Biometrika. It was in this work that he articulated what would come to be known as the unseen species problem. The problem is simply this: How many species are there in a population, including unseen species that do not appear in a given sample? And no shortage solutions have been advanced in the intervening years. I would venture the best-known among them is due to Bradley Efron, who in a fanciful application of the estimation procedure from his 1976 paper Estimating the number of unseen species: How many words did Shakespeare know?, found that Shakespeare knew at least 35,000 more words than he let on.
Fun in games aside, there is an analog to this statistical estimation problem in cancer genomics that merits serious consideration. More to follow...
Representative PublicationsI've collaborated with a number wetlab biologists from around Japan, including Koshi Mimori's team in the Department of Surgery at Kyushu University Beppu Hospital, Shigetaka Kitajima's team in the Department of Biomedical Genetics at the Medical Research Institute of Tokyo Medical Dental University, Kiyoshi Yamaguchi in the Division of Clinical Genome Research at IMS, Yoshinori Murakami's team in the Division of Molecular Pathology at IMS, Minsoo Kim in the Graduate School of Medical Science at Kyoto University, Emmanuel O. Balogun in the Department of Biomedical Chemistry at The University of Tokyo, and my longtime lunch-mate Hideto Koso in the Division of Molecular Developmental Biology at IMS.
Token Publication2008—Present
Preferential attachment is a process in which a given quantity is distributed among a number of objects in proportion to how much of the said quantity they already have. This process is widely celebrated in the complex network community owing to the fact that it is known to generate those power-law distributions thought to be characteristic of various phenomena in nature, society, and technology.
My interest in preferential attachment stems from my PhD work on gene network estimation. It was in this setting that I first became acquainted with the world of network generation models. The best-known of these is surely the Barabási-Albert model, which couples preferential attachment together with growth to generate scale-free networks; or, in other words, networks enjoying power-law degree distributions. This model supplies a simple theoretical explanation accounting for the supposed universality of scale-free networks. And its sudden appearance in the literature over a decade ago sparked a veratible cottage industry of scale-free network model making that reverberates down to the present day. My own contribution to this enterprise is the Poisson-growth model.
At any rate, what must be borne in mind is that the Barabási-Albert model was advanced as a hypothesis to explain the universality scale-free networks by appealing to the dual mechanism of growth and preferential attachment. Evidence in support of the preferential attachment hypothesis, as it came to be known, soon followed with the advent of ad hoc methods to check for its presence in real-world growing networks. The main thrust of my work in the field of complex networks concerns the development of statistically rigorous methods for detecting preferential attachment in growing networks. I laid the groundwork for this line of research in my PhD thesis, and I presently collaborate closely with Pham Thong, the graduate student of Hidetoshi Shimodaira who took up where I left off upon my graduation.
Representative PublicationsEvery network scientist knows that preferential attachment combines with growth to produce networks with power-law in-degree distributions. So how, then, is it possible for the network of American Physical Society journal collection citations to enjoy a log-normal citation distribution when it was found to have grown in accordance with preferential attachment? This anomalous result, which we exalt as the preferential attachment paradox, has remained unexplained since the physicist Sidney Redner first made light of it over a decade ago. In this paper we propose a resolution to the paradox. The source of the mischief, we contend, lies in Redner having relied on a measurement procedure bereft of the accuracy required to distinguish preferential attachment from another form of attachment that is consistent with a log-normal in-degree distribution. There was a high-accuracy measurement procedure in general use at the time, but it could not have been used to shed light on the paradox, due to the presence of a systematic error inducing design flaw. But in recent years the design flaw had been recognised and corrected. Here we show that the bringing of the newly corrected measurement procedure to bare on the data leads to a resolution of the paradox with important ramifications for the working network scientist.
Representative Publications2007—2011
In the course of completing my master's degree, I became interested in Hidetoshi Shimodaira's work on the statistical testing of phylogenetic trees estimated from molecular data. When I contacted Shimo about the possibility of pursuing my graduate studies under his supervision, he was enthusiastic to take me on as his very first PhD student. Then upon securing adequate funding in the form of a Monbukagakusho Scholarship it was not long before I was on the plane to Japan to join him at Tokyo Institute of Technology.
As it happened, though, we sooner or later deemed it wise for me to concentrate my efforts on the estimation of gene networks from gene expression data, which was still a trendy area of research in bioinformatics at the time. My primary theoretical innovation in this area concerned the proposal of a class of informative prior distributions over network structures. For practical purposes, I implemented these so-called scale-free structure priors within a Gaussian graphical modelling framework and estimated gene networks from publically available gene expression datasets. An example of one such gene network is shown in the figure on the right. This work constitutes one pillar of my PhD thesis; the other being statistical estimation methods for complex networks more generally as described above.
It was Shimo more than anyone who taught me how to do research. It's from him that I learned how to take an airy hunch, make a clear hypothesis to test on well-chosen examples, revise and generalize the result without getting bogged down in distracting details, and then write it up and send it to the publisher. On top of all that he took lenghts to attend to my personal, academic, and financial well-being that went above and beyond that which is demanded by his station. Needless to say I am deeply appreciative for the support he had given me over the years and we continue to actively collaborate on topics in complex network research.
Relevant Materials2004—2006
Its been said that statistics is the refuge of failed mathematicians, but in my case that is only party true. What happened is that following an abortive attempt to continue on in mathematics, I ultimately decided to pursue a master's degree in probability and statistics at Dalhousie University owing to a fairly elaborate set of considerations. It was not long thereafter that I settled on statistical estimation methods in molecular phylogenetics for my thesis topic, after taking a graduate course on the subject from Ed Susko. The transition from mathematics to statistics was no laughing matter, and I am forever indebted to Ed for his guidance and extrodinary patience along the way. This struggle in transitioning from the mathematical world of static perfection to the statistical world of arbitrary thresholds and rules of thumb is, in retrospect, evident from a casual rereading of my thesis. Truth be told, I could nowadays do a better job in the space of two weeks, on what I spent the better part of two years struggling with as a master's student. But that, I guess, is what comes with an additional decade of experience in research. I may even put this boast to the test by getting around to writing up a publication about this work one of these days.
Relevant Materials2002—2003
After spending a summer studying a bit about the arithemetic of elliptic curves under the guidance of Keith Johnson, he was nice enough to help me through my honours thesis the following year on hyperelliptic curve cryptography. Keith, or Dr. Johnson as I knew him in those days, was instrumental in shaping my approach to learning mathematics, and I am pleased to see that I am not alone in considering him to be a most excellent teacher; see Rate my Professor. Above all else he instilled in me that the road to grasping any abstract theory, be it mathematical or otherwise, is paved with concrete examples. Learning mathematics from Keith is just one of a great number of fond memories from my undergraduate days in the Math and Stats Department at Dalhousie University. I am especially grateful to Karl Dilcher, who kindly took the time to mentor me from as far back as when in high school; Bob Paré, who I consider to be something of an intellectual father figure; and Georg Gabor, who convinced me that probability theory is properly understood as an extension of deductive logic. Lastly, I would be remiss not to give a shout out to my longtime friend and co-graduate Adam Clay who went on to become an accomplished mathematician in his own right.
Relevant Materials