Frequently Asked Questions

1) How was this database generated?

2) What is coexpression analysis?

3) What does the correlation coefficents tell me?

4) How do I interpret the correlation coefficents?

5) What is the difference between the Pearson and Spearman correlation coefficents?

6) Why are only 638 genes out of 3714 on the WyeHamster2a microarray available in the database?

7) Are there plans to increase the number of genes over time?

8) Can I download the data that was used to create CGCDB?

9) Why are the microarray probesets annotated to mouse genes?


Question #1: How was this database generated?

This database was constructed from a dataset resulting from transcriptional profiling of 295 CHO production cell line samples was used in this study. Gene expression analysis was carried out on the proprietary CHO-specific WyeHamster2a oligonucleotide microarray representing an estimated 10-15% of the CHO transcriptome and manufactured according to standard Affymetrix protocols (Melville et al., 2011). The methodology and criteria used for total RNA purification, cRNA sample processing and hybridisation to hamster microarrays have been previously described (Doolan et al. 2008). All microarray data were pre-processed as described previously (Clarke et al., 2011).

Parameter Type(#samples/range)
Culture vessel type Bioreactor(172)
Shakeflask (123)
CHO cell type DUX(144)
K1 (151)
Recombinant protein produced Monclonal antibodies (230)
Fusion protein(41)
Growth factors(9)
Non-producing parental cells(12)
Seeding density (10^6 cell/ml) 262(Min = 0.15 -> Max=4.33)
Growth phase sample point Log (125)
Stationary (141)
Death (29)

The dataset was derived from a total of 121 individual cultures (63 from bioreactors and 58 from shake flasks) consisting of either CHO DUX or K1 cell lines grown in serum-free suspension culture seeded over a range of cell densities. Cell lines assayed include those producing monoclonal antibodies, fc-fusion proteins, growth factors or coagulation factors along with non-producing parental cell lines. 11 base media and 8 media feed formulations were present across the experimental dataset, in addition culture temperature was also varied i.e. the culture temperature was reduced 24 to 72hrs post inoculation depending on the specific process being run. Samples were hybridised to microarrays at various stages of the growth cycle and parallel measurements of culture performance were also determined where applicable. The number of samples assayed from each category utilised in the analysis is summarised in (Table I).

Characteristic #Samples Min Max
Growth rate (h-1) 225 -0.0058 0.0467
qLactate (mg/10^6 cell/day) 129 -0.1178 1.42
qAmmonia (pM/10^6 cell/day) 145 -0.282 1.81
Viability (%) 237 28.00 100
Cell density (10^6 cells/ml) 272 0.50 20
Titre(mg/L) 167 1.8 3005.3
Qp (pg protein cell/day) 167 0.52 55.39

The data utilised in this database can be freely downloaded from the NCBI gene expression omnibus (GEO)

Question #2: What is coexpression analysis?

Coexpression analysis is a powerful, widely used methodology for the investigation of underlying patterns in gene expression data. This “guilt-by-association” approach aims to locate groups of genes with closely correlated expression levels; if consistent correlations are observed across a phenotypically diverse set of samples a functional association is suggested. Recently the NICB bioprocessing group has carried out a large scale analysis of the CHO transcriptome using weighted gene coexpression network analysis (WGCNA).

Question #3: What does the correlation coefficents tell me?

The correlation coefficent tells us the strength of the relationship between two genes or a gene and phenotype. The number ranges from -1 to 1, the closer the number to either of these extremes the stronger the relationship. A negative number indicates an inverse correlation (the expression of gene A tends to increase as gene B decreases), a positive number indicates a positive correlation (i.e as gene A increases gene B tends to increase).

Example of positive correlation between two genes. PCC = 0.92.

Example of negative correlation between two genes. PCC = -0.74.

Question #4: How do I interpret the correlation coefficents?

A correlation coefficent provides a measure of the strength of the relationship between two genes or a gene or phenotype. One can apply general rules of thum to interpret correlation coefficents.
0.9 to 0.1: Very Strong correlation
0.7 to 0.9: Strong correlation
0.5 to 0.7: Moderate correlation
0.3 to 0.5: Excellent correlation
0.0 to 0.3: Little if any correlation

Question #5: What is the difference between the Pearson and Spearman correlation coefficents?

The Pearson correlation coefficent

The Pearson correlation coefficient is the most widely used correlation measure in the literature for coexpression analyses. Measures the strength of the linear relationship between variables.

The Spearman correlation coefficent

The Spearman rank correlation does not assume that the variables have a linear relationship. Calculation begins with ranking the values in each variables followed by calculation of the Pearson correlation coefficient on the ranked data. The Spearman correlation coefficient is less affected by outliers however some of the precision in the data is lost through ranking.

Question #6: Why are only 638 genes out of 3714 on the WyeHamster2a microarray available in the database?

Control probesets and probesets that did not reach 100 flourescent units (generally considered an appropiate threshold on affymetrix microarrays for presence/absence calling) were removed. In addition, the dataset was further reduced via a combination of low variance and connectivity (a measure of the coexpression of a gene with all other genes across the dataset across the dataset) filtering. All remaining probsets had a variance > 0.05 and connectivity > 2.5, resulting in a cohort of 750 probesets (~20% of total) with the greatest variation and connectivity and thus the highest confidence coexpression relationships.

Question #7: Is the plans to increase the number of genes over time?

Yes we plan to add more genes in the future. We also encourage other groups in the field to submit their microarray data.

Question #8: Can I download the data that was used to create CGCDB?

Yes, the full dataset can be downloaded from the NCBI gene expression omnibus (GEO)

Question #9: Why are the microarray probesets annotated to mouse genes and not the Chinese hamster?

The Wye2aHamster microarray was designed and built in the days before the CHO genome was sequenced. The array can be mapped to human, mouse and rat, we only utilise the mouse annotation at present within the database.