Cathy Stein 4-14-11

Host Team: Kathryn Ek, Loren Pachuta, Andy Whillans


Cathy Stein, Ph.D. is a genetic epidemiologist whose research in host genetic susceptibility to tuberculosis is part of a large observational study in Uganda. She earned her B.S. in biology with minors in math and probability & statistics from John Carroll in 1999, and her Ph.D. in genetic epidemiology from Case Western Reserve University in 2004, and is currently an Assistant Professor in Epidemiology & Biostatistics. She will discuss the computational aspects of her research, including:

  1. Analysis of high-throughput data and specialized IT needs
  2. Database design and management
  3. Software design

Genetic Epidemiology

Genetic epidemiology is the study of genetic factors in determining diseases and health patters in families as well as populations. Not only does it deal with genetic factors, it also considers environmental factors and their affect on health and diseases on groups of people.
The determination of genetics as a cause for disease progresses through four studies, each contributing to the development of a model for verification.

  • Familial aggregation- Is there a genetic component to the disease, if so, what are the relative contributions of genes and environment?
  • Segregation- What is the pattern of inheritance of the disease (dominant or recessive)?
  • Linkage-On which part of the chromosome is the disease gene located?
  • Association- Which allele of which gene is associated with the disease?

The ultimate goal of Genetic Epidemiology is genetic disease prevention and treatment.



Researcher at Case Western Reserve studying genes related to risk of disease and their genetic link (if any) through lineage. This case requires four areas of study : Biology, Statistics, Mathematics and Computation.

This study requires specific software to handle massive amounts of data. There are 3 billion base pairs in the human genome; this software needs to store and analyze these pairs in each person studied. This can mean there could be millions of variants from case to case. Workers need to collaborate their data as well, adding to the computational need. The IT needs for this department are much different than most.

Tuberculosis in Uganda

This study requires clinical visits to hospitals in Uganda to collect data and DNA samples. Blood from TB patients is sent to the lab. These large data sets are handled in either C or Python. Cathy Stein currently works from Cleveland but is still highly active in the process in Uganda. There are incentives for Ugandan people to participate in the study; many are given drugs and/or treatment free of charge.

Database Management

There are those working with the study who strictly focus on database management. Their responsibilites are to monitor incoming data, manage the data, and repair the database if necessary. Programs used are Microsoft Access and SAS (Statistical Analysis Software) when performing database tasks. Currently, they are run on Linux Systems, specifically RedHat, because the data set is too large for Windows to handle. The task of pulling data sets together requires lots of script writing, a skill Cathy learned in school and finds very important. However, there is a great need for programmers who can create statistical models.

Current Dilemma

It has been a struggle for the research development team Cathy works with to find a specific type of person able to handle all aspects of this job. There needs to be a balance between Computer Scientists and Mathematicians. Computer Scientists work with the code directly, create programs to handle data, and engineer software to store the data. Mathematicians use advanced algorithms to solve statistical discrepancies, however many only have limited programming backgrounds. There is a gap between the two; one that is a necessity to fill.


Examples of the type of data Cathy works with is explained further in the links below.

Structure of DNA

Genotype Variations



Statistical Programming

The combination of Computer Science and Statistics to analyze and maintain large data sets using Statistical Programming Languages such as SAS, ADMB, OxMetrics, Quantum andn XLispStat (to name a few).

Some courses at JCU are offered that could assist in understanding of this field. Some experience in Biology, Statistics, Math, and Computer Science is necessary.

- General Biology
- Cell Biology
- Molecular Biology

- DNA Protein

- Numerical Analysis
- Statistics(theoretical)
- Linear Algebra


Certain types of software do exist for these sorts of analyses. It can break DNA into chunks, and can line up sequence changes in an orderly fashion, greatly improving knowledge. There is a free statistics package, 'R', written in C++. Often it is used as a base for a new program needed. It can be tested in command line to make sure it works correctly, then programmers can create a GUI to make it accessible for others.

Information Technology

There is an overall IT group for Case Western, but not located on site. Part Cathy's research team is comprised of their own IT department on site. They understand finding servers, help them install software, setup new computers, remove viruses, and retain and maintain present systems. This specific group understands some of the intense computational algorithms the team uses to analyze the genetic data processed daily. This research team has their own servers along with the "free" central server core.