My Research

Statistical methods for the life course epidemiology:
Life course epidemiology is now recognized as an established area in epidemiological research.  One example is the developmental origins of health and disease hypothesis, which tries to show that many chronic adult diseases might have originated in early life.  The ideal of life course epidemiology is to bring in various domains of epidemiological research into investigation of causes of chronic diseases.  To achieve this aim, Life course epidemiology examines hierarchical levels of evidences, such as the inequality of prosperity across countries (higher level), diets and behaviours (individual level) and genetic polymorphism (lower level).  One key research hypothesis is that there may be critical phases of growth, which is crucial to body system development, and impaired growth in these periods may have a long-lasting effect on health in later life.  However, to detect these critical phases are not straightforward, and advanced statistical methodologies have been adopted in recent years to analyse life course epidemiological data.  Each method has its advantages and limitations, and real cohort data sets and computer simulations can be used to compare these methods.  In 2010, I published a paper in Epidemiology (with a commentary by Professor Tim Cole and my reply) on using partial least squares regression (PLSR) to analyse life course data.  The advantage of PLSR over standard generalised linear modelling is that it can deal with the collinearity problem in the life course data caused by the high correlations amongst repeated measures of body size.  A further development of PLSR approach is to transform original body size to relative body size to identify the critical phases of growth related to adverse health outcomes in life course. 

Statistical methods for the age-period-cohort analysis:
In linear models, the identification problem occurs when the design matrix is not full rank, and this can be caused by: (1) the number of covariates exceeds the number of observations (e.g. in microarray data, the number of genes is much larger than the number of patients); or (2) there is perfect collinearity amongst the covariates.  In clinical and epidemiological research, the identification problem is usually caused by the latter.  One well-known example is Age-Period-Cohort (APC) analysis, because Age + Cohort = Period.  One longstanding controversy in epidemiology and sociology surrounds how to estimate or to separate the distinct impacts of age, time period, and cohort on the changes in, for example, attitudes, behaviours and health outcomes in the population.  Due to the intrinsic mathematical relationship amongst the 3 variables, there is an identification problem in traditional regression analysis.  For example, suppose researchers observe an increasing trend in the incidences of the type-I diabetes in children in a geographic area over the last three decades, they hypothesize that this trend might be due to: (1) improved diagnostic skills in early indentifying young patients (i.e. time period effect); or (2) decreased early infections due to improved hygiene and living environment (cohort effect); or maybe both.  However, as the risk of the type-I diabetes also increases with age, to separate the effects of period and cohort, age too has to be accounted for.  Since the 3 variables are mathematically related, one has to be removed from the standard regression model; otherwise, mathematical computation cannot proceed.  There have been many attempts to overcome this estimation (identification) problem in the APC analysis.  One common approach is to put constraints in the estimation process to overcome the computational problem of insufficient rank in the data matrix. Although this type of modelling strategy produces simultaneous estimates of the age, period, and cohort effects, it has been criticized in the statistical literature because the results are sensitive to the constraint chosen, and there is no empirical way to confirm the validity of the chosen constraints.  Although traditional regression analysis requires that the data matrix for covariates is full-rank to proceed with computation, this is not a requirement for partial least squares regression.  I have shown that for the identification problem, such as the one in APC analysis, an implicit constraint is effectively present in order to obtain unique solutions in the estimation of PLS, and this constraint in the estimation is equivalent to the mathematical relation amongst the prefect collinear covariates, i.e. the constraint is actually inherent in the data. 
Network meta-analysis:
Also know as mixed treatments comparison, this is a recently developed methodology for evidence synthesis. I have been involved in the teaching and undertaking standard meta-analysis, since I first started my job in Leeds.  My first project is to help with the meta-analyses for the associations between food/diet/ physical activities and gastric/pancreatic cancer.  That project was funded by the World Cancer Research Fund, and when the results came out in 2008, they were widely reported in the world media.  Since then, I have undertaken several meat-analyses in epidemiology and dentistry.  However, traditional meta-analysis only makes pair-wise comparisons between an active treatment and a control or between two active treatments, but most disease conditions have more than one treatment options.  It is very rare that all available treatment options are compared in one single study. We therefore need a statistical framework to incorporate available direct and indirect evidence to provide an overall comparison for all treatments in a single meta-analysis. This statistical framework is called Network Meta-analysis and has been recently proposed under the Bayesian analysis paradigm.  I learned the details of this approach by attending a three-day course on Bayesian network meta-analysis in the University of Leicester, which was delivered by people who have been developing this method.  Since then, I have started my own research project with colleagues in the UK and Germany to apply this method. Network meta-analysis is going to play a very important role in Comparative Effective Research and Cost-Effective analysis to provide guidelines for clinical decision making and patient care, because Network meta-analysis can provide a holistic comparison for all available treatments.  This is especially relevant to the National Health Insurance Scheme in Taiwan, as the cost of health care has been growing faster than the budget does. 
Association between oral health and systemic diseases:
The relation between poor oral health and an increased risk of systemic diseases such as diabetes and cardiovascular diseases has been a heatedly debated issue within dental and medical research.  Although many epidemiological studies found a weak association, most of these studies were not originally designed to test this hypothesis and usually the measurements of oral health were very crude.  In fact, there is no consensus about what are the best measures of oral health, especially periodontal health.  Because measuring periodontal health in routine practice is generally very time-consuming and requires taking dental radiographs, and therefore an alternative method is required.  I have been using the Glasgow University Alumni cohort to look at the tooth loss and mortality patterns, and my study was published in Heart in 2007, which was widely reported by Reuters on the internet.  Recently, I have been working with colleagues in the School of Dentistry at the National Taiwan University and in the Health Management Centre at the National Taiwan University Hospital to use their health screening data to look at dental diagnosis and risk of metabolic syndrome in a cohort with more than 23,000 patients.  As I was first trained in dentistry and then epidemiology, I am in a very good position to pursuer research in this area.
Mathematical coupling of data in regression and correlation analysis:
I have examined the problem of analysing the relation between change and initial value, where mathematical coupling between x – y and x makes the testing of the usual null hypothesis inappropriate.  I then showed that the use of ratio variables may give rise to ambiguous results in regression analysis.  The next challenge will be to take these ideas to the problem of analysing the relation between percentage change and initial value, where mathematical coupling occurs between (x – y)/x and x, and finally to the generalised form of the relation between x/y and w/z.  One example of this problem in epidemiological research is ecological studies, where many variables share the same denominator, e.g. population statistics, such as mortality rate and diseases prevalence use the size of population as the common denominator, so these statistical indices tend to be correlated.  One such real example is a study published by Archie Cochrane 30 years ago, in which he found the infant mortality rate was highly correlated with paediatricians per capita in the developed countries.  Although this problem has been known for many years, no simple solution has been found yet.