Chapter 4
MICROARRAY- METHODOLOGY, STATISTICAL ANALYSIS AND IMAGE ANALYSIS
Table of contents: 1. Introduction 2. Methdology Stages involved 2.1 Experimental Design 2.2 Preparation of mRNA and cDNA 2.3 Hybridization 2.4 Image scanning 2.5 Data analysis 2.6 Biological confirmation 2.7 Deposition into databank 2.8 Analysis of data with related data 3. Advantages 4. Disadvantages 5. Statistical analysis of microarray data 6. Microarray image analysis
1. Introduction: A microarray is a solid support (ex. Glass slide or nylon membrane) on which DNA of known sequence is deposited in an regular grid like array. The DNA may take the form of cDNA or oligonucleotides, although other materials (such as genomic DNA clones) may be deposited as well. Several nanograms of DNA are immobilized on the surface of an array. RNA is extracted from the biological sources of interest-also known as targets, such as cell lines with or without drug treatment, or tissues from wild type or mutant organisms. The RNA (or mRNA) is often converted to cDNA, labeled with fluorescence or radioactivity and hybridized to the array. During this hybridization, cDNAs derived from RNA molecules in the biological starting material can hybridize selectively to their corresponding nucleic acids on the microarray surface. Following washing of the microarray, image analysis and data analysis are performed to quantify the signals that are detected.
Edited by- Anand M.B, Pavitra J.,Pranav A.S,Kashyap A.C. Published by- Anand M. B
Page 1
Chapter 4
MICROARRAY- METHODOLOGY, STATISTICAL ANALYSIS AND IMAGE ANALYSIS
2. Stages involved: An overview of the microarray procedures are given in figure below: Stage 1:
Experimental design
Stage 2: RNA preparation and probe preparation Stage 3:
Comparison of two biological samples
Stage 4: Image analysis. Stage 5: Data analysis
Stage 6: Biological confirmation
Stage 7: Data is deposited in an database Edited by- Anand M.B, Pavitra J.,Pranav A.S,Kashyap A.C. Published by- Anand M. B
Page 2
Chapter 4
MICROARRAY- METHODOLOGY, STATISTICAL ANALYSIS AND IMAGE ANALYSIS
Stage 8: Further analysis
2.1 Experimental design: The experimental design of an microarray can be considered in three parts: 1) Biological sample comparison: The biological samples are selected for comparison, such as cell lines with or without drug treatment. If multiple samples are used, these are called “biological replictates”. 2) Extracting RNA, converting and labeling: The RNA is extracted and labeled with radioactivity or fluorescence. 3) Arrangement of array elements on a surface: The array elements are arranged in a randomized order. In some cases, array elements are spotted in duplicate. 2.2
RNA preparation and probe preparation: RNA can be purified from cells or tissues using reagents such as TRIzol. In comparing two samples,it is essential to purify RNA under similar conditions. The purity and quality of RNA should also be assessed spectrophotometrically by measuring a260/a280 ratio and by gel electrophoresis. Probe is generated and is labeled with fluorescence/ chemical dyes.
2.3
Hybridization: In hybridization, two different samples are being used: (i) (ii)
Target sequence which consists of 100-2000 cDNA or oligonucleotides Probe sequence
Target sequences are already fixed on glass slide or nylon membrane or silicon chips.
Edited by- Anand M.B, Pavitra J.,Pranav A.S,Kashyap A.C. Published by- Anand M. B
Page 3
Chapter 4
MICROARRAY- METHODOLOGY, STATISTICAL ANALYSIS AND IMAGE ANALYSIS
Probe sequences are complementary of gene which is to be analyzed and is labeled with radio activity or fluorescence. Probes are added in target, kept overnight and washed away in morning. 2.4
Image analysis: After washing, image analysis is performed to obtain a quantitative description of the extent to which each mRNA in the sample is expressed. For experiments using radioactive probes, image analysis is performed by using quantitative phosphor imaging. For fluorescence-based microarrays, the array is excited by a laser and fluorescence intensities are measured. Data for Cy5 and Cy3 may be sequentially obtained and used to obtain gene expression ratios.
Figure 1: Example of microarray experiment using radioactive probes. 2.5
Data analysis: Analysis of microarray data is performed to identify individual genes that have been differentially regulated. It is also used to identify broad patterns of gene expression. Some statistical methods are also used for data analysis.
Edited by- Anand M.B, Pavitra J.,Pranav A.S,Kashyap A.C. Published by- Anand M. B
Page 4
Chapter 4
2.6
MICROARRAY- METHODOLOGY, STATISTICAL ANALYSIS AND IMAGE ANALYSIS
Biological confirmation: Microarray experiments can be thought of as “hypothesis-generating” experiments. The differential up-or down-regulation of specific genes can be measured using independent assays such as: Northern blots polymerase chain reaction (RT-PCR) in situ hybridization
2.7
Deposition into Databases:
2.8
Most academic researchers agree that microarray data should be deposited in public repositories upon publication for future reference. Such databases can be classified mainly into two types: (i) Microarray database – stores complete set of raw and processed data. (ii) Gene expression database – Mainly stores the expression of gene in tissues etc. There are two main repositories: (i) Gene Expression Omnibus at NCBI. (ii) ArrayExpressat at the European Bioinformatics Institute (EBI).
Further analysis: We can use stored data for further experiments. It is likely that uniform standards will be adopted for all microarray experiments. An ongoing trend in the field of bioinformatics is the unification and crossreferencing of many databases, such as has occurred for databases of molecular sequences and for databases of protein domains.
3. Major advantages: Fast: One can obtain data on the expression levels of over 10,000 genes within a single week. Comprehensive: The entire yeast genome can be represented on a chip. Flexible: cDNA or oligonucleotides corresponding to any gene can be represented on a chip.
Edited by- Anand M.B, Pavitra J.,Pranav A.S,Kashyap A.C. Published by- Anand M. B
Page 5
Chapter 4
MICROARRAY- METHODOLOGY, STATISTICAL ANALYSIS AND IMAGE ANALYSIS
4. Major Disadvantages: Cost: Many researchers find it prohibitively expensive to perform sufficient replicates and other controls. Unknown significance of RNA: The final product of gene expression is protein, not RNA. Uncertain quality control: It is impossible for most investigators to asses the identity of DNA immobilized on any microarray. Also, there are many artifacts associated with image analysis and data analysis.
5. Statistical analysis A. B. C. D.
Normalization process Scatter Plots Inferential Statistics Descriptive Statistics
A. Normalization Process: Normalization is a process of correcting two or more dataset prior to comparing their gene expression values. We use different efficiency dyes, so the results of micro array obtained is different. To correct that result normalization technique is used. Normalization for two purpose:i. For comparing gene expression values in one particular microarray. ii. For comparing gene expression values for different microarray. In this technique we use ratio value of both dye Cy5, Cy3 for getting intensity of color. If gene expressed in both tissues ratio is 1. If gene is expressed in one tissue it gives ratio other than 1. For normalization, firstly we have to find background intensity of micro array spot. Before hybridization, spots are present which have some intensity i.e background intensity. Global normalization is based on assumption that average gene doesn’t change its expression in two different test condition. Means two different gene, normal and diseased one, only have few difference in their expression. We should apply normalization technique to each of these data sets. Each of this dataset is divided by a correction factor. Average intensity value for each dataset is taken. Then take average expression value of both dataset and then multiply any one dataset with correction factor. Normalizing all expression value to housekeeping gene that is represented on the array. Edited by- Anand M.B, Pavitra J.,Pranav A.S,Kashyap A.C. Published by- Anand M. B
Page 6
Chapter 4
MICROARRAY- METHODOLOGY, STATISTICAL ANALYSIS AND IMAGE ANALYSIS
Taking genes from microarray which doesn’t change their expression level in both test condition. Average intensity value of each gene is divided by average intensity value of housekeeping gene. Accuracy of this technique is based on selection of housekeeping genes. B. Scatter plots: For visualization of microarray datasets we use scatter plots. It shows comparison of gene expression values from two samples. By looking scatter plots, we can know which genes are up regulated, down regulated and normal gene. Gene up regulated
Cy5 signal from sample 1
High level expression
Gene Down regulated
Low level expression
Cy3 signal from sample 2
All normal genes are present on this 450 line. X axis :- cy3 signal from sample 2 Y axis:-cy5 signal from sample 1
Clicking on each spot, gene information or name is displayed. Sometimes many spots come down (0, 0) to the line. So to centralize it, log value is used. Central regulation is used to describe fold regulation of genes. Time(+)
Behaviour of gene
Raw Ratio
log ratio
Edited by- Anand M.B, Pavitra J.,Pranav A.S,Kashyap A.C. Published by- Anand M. B
Page 7
Chapter 4
MICROARRAY- METHODOLOGY, STATISTICAL ANALYSIS AND IMAGE ANALYSIS
0
Based level of expression (1st time) 1.0
0.0
1
No change
1.0
0.0
2
Two fold up regulation
2.0
1.0
3
Two fold down regulation
0.5
-1.0
C. Inferential Statistics: To find the differentially expressed gene in different test condition inferential statistics is used. First derive null hypothesis:“There is no difference in signal intensity across the test being tested.” OR “all the intensity value are same in all test condition (all expression same).” We calculate, a test statistics which is a value that characterized observed gene expression data (statistics parameter). We accept or reject hypothesis based on the value of the test statistics. If accept => not differentially regulated. If reject => gene expressed diffentially. Test statistics (it may be control or experiment):t=
x1−x2 𝜎
x1 = intensity value of gene in one microarray. x2 = intensity value of same gene in other microarray. = standard deviation (deviation from mean) If t >p accept Ho (p=probability) t >p reject Ho. D. Descriptive Statistics To derive meaningful Biological data information from microarray experiment dataset descriptive statistics is used. Used to visualize microarray data. Edited by- Anand M.B, Pavitra J.,Pranav A.S,Kashyap A.C. Published by- Anand M. B
Page 8
Chapter 4
MICROARRAY- METHODOLOGY, STATISTICAL ANALYSIS AND IMAGE ANALYSIS
Each of these techniques involves the reduction of highly dimension data to reach conclusion. In each case, we begin with a matrix of genes(typically arranged in rows) and samples (typically arranged in columns). Appropriate global or local normalization are applied to the data. Then some metric is defined to describe the similarity(or alternatively to describe the distance) between all the data points. Two common metrics to define distance between gene expression data points are there:1. Euclidian distance. 2. Pearson Coefficient of correlation 1. Euclidean Distance Based on distance between two spot in microarray, that is calculated by:𝐷12 = (x1 − y1)2 + (x2 − y2)2 + (x3 − y3)2 Two points are there(x1, x2, x3) and (y1, y2, y3). We can generalize above equation for n dimension:n
(xi − yi) 2
𝐷= i=1
This distance is used in clustering. 2. Pearson correlation coefficient For any set of numbers, X={x1, x2, x3…..,xn} Y={y1, y2, y3…..,yn} 𝐫=
𝐱𝐢−𝐱 𝐲𝐢−𝐲 𝐧 𝐢=𝟏 𝛔𝐱 𝛔𝐲
𝐍−𝟏
Where -1< r <1 r=0 -> completely independent r=1 -> both are similar r=-1 -> uncorrelated
Edited by- Anand M.B, Pavitra J.,Pranav A.S,Kashyap A.C. Published by- Anand M. B
Page 9
Chapter 4
MICROARRAY- METHODOLOGY, STATISTICAL ANALYSIS AND IMAGE ANALYSIS
6. Microarray image analysis: 6.1 Introduction: Microarray hybridization experiments are used to measure concentrations of many different nucleic acid sequences in biological samples in parallel. One of the most important applications is gene expression analysis which can give insight into an organism’s metabolism and its regulation. In brief, a microarray gene expression experiment works as follows: 1. DNA sequences relevant to the biological question are selected and printed onto glass slides, one ’spot’ for each sequence in a rectangular grid layout. 2. These target sequences are linked to the glass surface. 3. From the biological sample in question as well as from a reference sample, mRNA is extracted from which fluorescently labeled DNA is produced by reverse transcription using labeled nucleotides. 4. The labeled DNA sequences are brought onto the array surface for hybridization. 5. The reference and probe samples are labeled with different dyes and hybridized to the array at the same time. 6. Separate fluorescence images of the reference and probe dyes are taken using a confocal laser scanning microscope. 7. The ratios of reference and probe dye intensities at each spot are interpreted as changes of gene expression levels between the two samples after correction for dye-specific labeling efficiencies, fluorescence activities and other factors by so-called normalization techniques. Before ratios can be computed and passed on to a data analysis pipeline, appropriate image regions must be mapped to the positions of the print grid or the printed sequences, respectively. This task called ’gridding’ is usually done using semi-automatic programs like Scanalyze, for example. It can be quite a tiring and error-prone work for experiments comprising many microarrays and does not suit the idea of an automated analysis pipeline very well. Some more or less automatic image processing routines have been proposed, but there is no solution yet that does not impose restrictions on the array design. Methods that use calibrated environments can be quite valuable when used in single larger experiments or in single microarray facilities that produce their own arrays. As soon as data from more than one source are combined for analysis, automated procedures that work with any array design seem necessary, as homogeneity of the data is important. The matter is closely related to the standardization efforts for microarray experiment data exchange. 6.2 Methodology: There are two methods used for Microarray Image Analysis. The two methods include: Grid Segmentation &Spot Quantitation.
Edited by- Anand M.B, Pavitra J.,Pranav A.S,Kashyap A.C. Published by- Anand M. B
Page 10
Chapter 4
MICROARRAY- METHODOLOGY, STATISTICAL ANALYSIS AND IMAGE ANALYSIS
i. Grid Segmentation: The first goal is to find the print grids in a microarray image, i.e. the row and column distances, axes and origins to recover the positions of printed spots. The problems that have to be tackled are caused by the deficiencies of the printing and scanning devices as well as properties of the hybridization assay:
Noise / Contaminations of the slide surface
Rotation of grid axes relative to image
Print grids are not aligned with each other () no global periodicity)
Image channels may have relative shift of some pixels
Spot shape and size varies
Signal intensity is highly dynamic
The figure describes the structure of the image processing system that is proposed. Fast region based approach (i.e. an approach that reduces ’continuous’ gray level images to connected sets of bright pixels) is used for grid segmentation. Regions in approximate periodic spatial layout are grouped into ’partial grids’, which then enable estimation of grid rotation. To find the boundaries of individual grids, the abundance of partial grids is projected onto the estimated axis. Many segmentation hypotheses are generated by placing putative grids at rising and falling edges of the projection diagram. A subset of the hypotheses representing an optimal global grid segmentation of the image is chosen using a dynamic programming algorithm. The optimal segmentation must not contain overlapping grids and covers as many spot regions as possible. Finally, a heuristic assignment procedure is employed to assign regions to nodes of grids. The minimum input data apart from the scanned images are the numbers of rows and columns of each grid, but more side information about the placement of grids can also be used to impose constraints on the allowed segmentation results, yielding more robust processing.
Edited by- Anand M.B, Pavitra J.,Pranav A.S,Kashyap A.C. Published by- Anand M. B
Page 11
Chapter 4
MICROARRAY- METHODOLOGY, STATISTICAL ANALYSIS AND IMAGE ANALYSIS
ii. Spot Quantitation: The quantity of interest for each spot is the ratio of the amounts of label dye in the two channels (we name them red and green). The unnormalised ratios are computed that have to be corrected for different labeling efficiencies and fluorescence activity etc. of the dyes, since knowledge of the experiment design is necessary for normalisation. Properly chosen hybridisation parameters ensure a simple proportionality of the printed sequence density and the amounts of hybridized labelled cDNA at any spatial position within a spot. From these model assumptions pixels are derived which has low signal may be skipped in the ratio computation, while it is important to use the same image region in both channels. Possible shifts of the channels are corrected using the region centres from the grid segmentation. The Mann-Whitney segmentation is used to identify and discard low signal pixels for spot quantitation. They use a statistical test to find sets of signal pixels with intensities significantly different from the local background. The method does not require strong assumptions on the spot shape, and the grid segmentation supplies the necessary approximate spot positions and sizes. Conclusion The main problem with microarrays is the variability. If the test and reference samples are dived and allowed to hybridize on two different slides the correlation between log ratios is in the range of 60 to 80%. This problem is often addressed by using series of repeated experiments and repeatedly spotted clones. Edited by- Anand M.B, Pavitra J.,Pranav A.S,Kashyap A.C. Published by- Anand M. B
Page 12
Chapter 4
MICROARRAY- METHODOLOGY, STATISTICAL ANALYSIS AND IMAGE ANALYSIS
However, in many cases the amount of test (and reference) mRNA is limited and repeated 25 experiments are not possible. Therefore it is essential to optimize every step from the manufacturing of slides to the final analyze of scanned data. Different noise sources should be identified and, if possible, reduced.
7. References: a) Headings 1 to 4 (Microarray methodology- Anand M.B, Pavitra Jeyakumar): 1. Bibliography: 1> Bioinformatics and functional genomics, 2003 edition, Pevsner J. 2. Webliography: 1> http://pevsnerlab.kennedykrieger.org/ppts/2006-09-18_lect05_ch6.pdf 2>http://plasticdog.cheme.columbia.edu/undergraduate_research/projects/sahil_mehta_ project/introduction.htm b) Heading 5th(Statistical analysis of microarray data- Pranav A.S): 1. Bibliography: 1> Bioinformatics and functional genomics, 2003 edition, Pevsner J. 2. Webliography: 1> http://pevsnerlab.kennedykrieger.org/ppts/2006-09-18_lect05_ch6.pdf c) Heading 6th (Microarray image analysis- Kashyap Chhatbar): 1. Bibliography: 1> B. Alberts, D. Bray, J. Lewis, K. Raff, K. Roberts, and J. Watson. Molecular Biology of the Cell. 2. Webliography: 1> www.techfak.uni-bielefeld.de/GK635/publikationen/download/Katzer2002-RAM.pdf 2> www.maths.lth.se/bioinformatics/calendar/20031209/ABengtssonMicroarrayImageAnalysis- 20031209.pdf
Edited by- Anand M.B, Pavitra J.,Pranav A.S,Kashyap A.C. Published by- Anand M. B
Page 13