Graph - Theoretic Approach To Understand Protein Functioning Using Contact Network

  • Uploaded by: Harsh Purwar
  • 0
  • 0
  • June 2020
  • PDF

This document was uploaded by user and they confirmed that they have the permission to share it. If you are author or own the copyright of this book, please report to us by using this DMCA report form. Report DMCA


Overview

Download & View Graph - Theoretic Approach To Understand Protein Functioning Using Contact Network as PDF for free.

More details

  • Words: 2,669
  • Pages: 13
Indian Institute te of Science Education & Research, Kolkata

Semester – IV, April 2009

Graph-theoretic theoretic approach to understand protein functioning using contact network Project Report Indian Institute of Science Education & Research, Kolkata Semester – IV, April 2009 Project Done By: Harsh Purwar (07MS (07MS-76) Debanjan Basu (07MS (07MS-71) Sudhanshu Pandey (07MS (07MS-80) BASIC THEORY: Network: A network consists of components/elements which interact in order to facilitate flow of information, matter or energy to form a composite ‘whole’ (system). A network can be built for any system consisting of a larg large number of interacting units. A large number of huge networks have been modeled of mapped using the graph theoretic approach. For example large biological systems involving metabolic pathways, protein proteinprotein interactions, ecological networks etc. have been been mapped (understood) using graphs. Typically a graph contains: • Nodes or Elements (components) – represented as a dot and • Links (interactions) – represented by a line. Following is a simple graph with A, B, C…. as its nodes and the lines joining them aare links showing interactions between them.

Figure # 1

In this project we have mapped some (nine) proteins and studied various components of a network. In case of proteins nodes are the alpha carbon atom of an amino acid and since 1|Project Report

Indian Institute of Science Education & Research, Kolkata

Semester – IV, April 2009

amino acids have only one alpha carbon (Cα), we can also say that in a protein network a node represents an amino acid. The interactions between the various atoms or molecules in a protein are represented by links in a network. These interactions may be covalent or noncovalent. Generally long-range interactions are non-covalent in nature. These interactions in proteins can be mapped using the distance matrix. Distance Matrix: It is an ݊ × ݊ matrix which stores the distance between the ith and jth nodes. For example if we try to construct a distance matrix for the Figure # 1 above we get, a 6 × 6 matrix whose first element (1, 1) would be 0 and the next element in the same row i.e. (1, 2) storing the distance between the node A and B if we consider A as node number 1 and B as 2nd node. An important thing which might be confusing here is that, the distance matrix stores the direct distances or displacements between the nodes, irrespective of their interactions. In case of proteins distance matrix stores the distances between each pair of alpha carbons (or amino acids). Adjacency or Connectivity Matrix: After constructing the distance matrix we scan through the whole matrix and look for various kinds of interactions. The interactions depend upon the distance between the two nodes. So, if the distance between two nodes is between ‘a’ and ‘b’ then we say that the two nodes interact or are connected by a link. In this project we have considered that if the distance between the two amino acid groups or the two alpha carbons is less than 7.0 Å then they interact and we called it short-range interactions. Similarly if the distance is between 16.0 Å and 18.0 Å we called it as long range interactions. These long-range & short-range interactions play a vital role in protein folding and it’s functioning. In an adjacency matrix if the two nodes (let us say ith and jth) are connected that is if they interact then we put 1 otherwise 0 at the i,jth position in the adjacency matrix. If we construct a adjacency matrix for the figure # 1 above we get,

A B C D E F

A 0 1 0 1 0 0

B 1 0 0 0 1 0

C 0 0 0 1 1 1

D 1 0 1 0 0 0

E 0 1 1 0 0 0

F 0 0 1 0 0 0

Information Source: To study interactions we need a detailed information about the structure of the protein. The main source of information used in this project was the protein data bank file or .pdb file. This .pdb file is easily available over the internet (See references for more details). This file stores almost all the information about a specific protein in a text format. This data is obtained through X-ray diffraction methods and by NMR of various proteins extracted till date. In this project what we need is just a small portion of this file that stores the coordinates of various atoms in the protein core structure. Length of a protein: Length of a protein is given by the number of amino acids it consists of. Some proteins also contain certain glycol-lipids, glycerol moieties, etc which increases its length. In this project we have not considered them.

2|Project Report

Indian Institute of Science Education & Research, Kolkata

Semester – IV, April 2009

Degree of a node (ki): It is the number of nodes to which ith node is directly connected. Clustering coefficient (Ci): Cluster coefficient is the fraction of maximum possible links that the neighbouring nodes of a node ‘i’ have among them. We divide it by the total number of nodes to get the average cluster coefficient for the network. Path length (Lij): Number of nodes that must be traversed from node ‘i’ to node ‘j’ by the shortest path. By shortest path we mean the least number of nodes and not the distance. We couldn’t calculate the cluster coefficients and the path lengths due to the limited time.

THE RIGHT DIRECTION (METHODOLOGY) • • •

• •

Download a protein data bank file for the protein of your interest. Extract the coordinates of the various alpha carbon atoms from the file. Calculate the distance between various alpha carbon atoms using Euclidian distance formula, ݀ = ඥሺ‫ݔ‬ଵ − ‫ݔ‬ଶ ሻଶ + ሺ‫ݕ‬ଵ − ‫ݕ‬ଶ ሻଶ + ሺ‫ݖ‬ଵ − ‫ݖ‬ଶ ሻଶ and construct the distance matrix. Scan this distance matrix and construct form it the adjacency matrix on the basis of a distance threshold. We assumed it to be 7.0 Å. Calculate various network parameters and plot graphs on the basis of the adjacency matrix to study your protein.

COMPUTER PROGRAMS: Program # 1: Following is the program which extracts the required information (i.e. the coordinates of the alpha carbon aton in the amino acid) from a “pdb” file and writes them in another file named after the name of the pdb file followed by a suffix ‘_CA2’. That is if the name of the pdb file is ‘1ZQC’ then the file with the coordinates of the alpha carbon atoms will be ‘1ZQC_CA2’. This is a text file and can be opened in word pad (recommended) or note pad in windows. Linux or unix users can open this file using any text editor like gedit, nano, vim, etc. This file is created in the same directory in which the main program file is stored. This program also creates an intermediate file with suffix ‘_CA’. Program also plots or draws certain structures of figures and saves them in jpeg or jpg format. These plots are: • Protein backbone (suffix ‘_1.jpg’) • Protein short range interactions (distance between Cα’s less than 7.0 Å) (suffix ‘_3.jpg’) • Protein interactions (distance between Cα’s in range 9.0 Å – 11.0 Å) (suffix ‘_4.jpg’) • Long range protein interactions (distance in range 16.0 Å – 18.0 Å) (suffix ‘_5.jpg’) • Adjacency matrix for distance between Cα’s less than 7.0 Å (suffix ‘_2.jpg’) 3|Project Report

Indian Institute of Science Education & Research, Kolkata

Semester – IV, April 2009

NOTE: The green coloured dots or laun in the adjacency matrix diagram is the portion which corresponds to those Cα’s between which the distance is greater than 7.0 Å i.e. they are not considered to be interacting with each other (binary equivalent 0), whereas the red portion is where the Cα’s interact or are connected (binary equivalent 1). print 'Executing main.py' fin=input("Enter the name of the PDB file: ") #for fin in ['1ZQC','1E76','1IC9','1R5S','1UZ9','2I7U','1GGA','1FCD','2ZW3']: f1=open(fin,'r') f2=open(fin+'_CA','w+') pos=0 for i in f1.readlines(): f1.seek(pos,0) s2=f1.readline(100) if len(s2)>15: if s2[0]=='A' and s2[1]=='T' and s2[2]=='O' and s2[3]=='M' and s2[4]==' ' and s2[13]=='C' and s2[14]=='A': f2.write(s2) elif s2[0]=='H' and s2[1]=='E' and s2[2]=='T' and s2[3]=='A' and s2[4]=='T' and s2[5]=='M' and s2[6]==' ' and s2[13]=='C' and s2[14]=='A': f2.write(s2) elif s2[0]=='M' and s2[1]=='O' and s2[2]=='D' and s2[3]=='E' and s2[4]=='L' and s2[13]=='2': break pos=f1.tell() f1.close() print 'Exiting... main.py' print 'Executing main2_2.py' f3=open(fin+'_CA2','w+') pos=0 f2.seek(pos,0) for i in f2.readlines(): f2.seek(pos,0) s1=f2.readline(100) s=s1.split() print>>f3,s[6],s[7],s[8] pos=f2.tell() f2.close() print 'Exiting... main2.py' print 'Executing main3.py and ad_mat.py...' f2=open('A','w') f0=open('B','w') f4=open('C','w') ff2=open('E','w') ff0=open('F','w') # Calculate the length of the protein f3.seek(0,0) tot=len(f3.readlines()) print 'Length of the protein is:',tot 4|Project Report

Indian Institute of Science Education & Research, Kolkata pos=0 lis=[] for i in range(1,tot+1): lis.append(pos) f3.seek(pos,0) st=f3.readline(50) pos=f3.tell() for i in range(tot): f3.seek(lis[i],0) ss1=f3.readline(50) z=i+1 while z0.01 and dis<7.0: f2.write(ss1+ss2) if dis>9.0 and dis<11.0: f0.write(ss1+ss2) if dis>16.0 and dis<18.0: f4.write(ss1+ss2) if dis>0.01 and dis<7.0: print>>ff2,i+1,z+1 else: print>>ff0,i+1,z+1 z+=1 if i+1!=tot: f3.seek(lis[i+1],0) f2.write(ss1+f3.readline(50)) if i+1!=tot: f3.seek(lis[i+1],0) f0.write(ss1+f3.readline(50)) if i+1!=tot: f3.seek(lis[i+1],0) f4.write(ss1+f3.readline(50)) f2.close() f0.close() f4.close() ff2.close() ff0.close() f3.close() print 'Exiting main3.py and ad_mat.py...' print 'Plotting Graphs...' from Gnuplot import Gnuplot 5|Project Report

Semester – IV, April 2009

Indian Institute of Science Education & Research, Kolkata

Semester – IV, April 2009

gp=Gnuplot() gp("set term jpeg") gp("set out '"+fin+"_1.jpg'") gp("splot '"+fin+"_CA2' w lp ls 7") gp("unset out") gp("set out '"+fin+"_2.jpg'") gp("plot 'E' w p, 'F' w d") gp("unset out") gp("set out '"+fin+"_3.jpg'") gp("splot 'A' w l, '"+fin+"_CA2' w lp ls 7") gp("unset out") gp("set out '"+fin+"_4.jpg'") gp("splot 'B' w l, '"+fin+"_CA2' w lp ls 7") gp("unset out") gp("set out '"+fin+"_5.jpg'") gp("splot 'C' w l, '"+fin+"_CA2' w lp ls 7") gp("unset out") print 'Exiting from the program...'

Program # 2: Following is a python program which writes the adjacency matrix in a file on the basis of the distance between the two alpha carbons in an amino acid. Here we have considered two alpha Carbons to be connected if the distance between the two alpha carbons is less than 7.0 Å then they are connected and the binary equivalent of this is considered to be 1. In other words if the distance between ith Cα and jth Cα is less than 7.0 Å then interpreter puts 1 in the ith row and jth column of the adjacency matrix otherwise it puts 0 in the ith row and jth column of the matrix. The output file is with suffix ‘_Graph.csv’. NOTE: This file takes a file named after the name of the pdb file with suffix ‘_CA2’ as input, so make sure that it is present in the same directory in the following program file is present. # Generates the output file for ploting the adjacency or the connectivity matrix in csv format for mathematica or matlab. for fin in ['1ZQC','1E76','1IC9','1R5S','1UZ9','2I7U','1GGA','1FCD','2ZW3']: #fin=input('Enter the PDB filename: ') f3=open(fin+'_CA2','r') f2=open(fin+'_Graph.csv','w') tot=len(f3.readlines()) f3.seek(0,0) pos1=0 st='' for i in range(1,tot+1): f3.seek(pos1,0) ss1=f3.readline(50) pos1=f3.tell() pos2=0 for m in range(1,tot+1): 6|Project Report

Indian Institute of Science Education & Research, Kolkata

Semester – IV, April 2009

f3.seek(pos2,0) sf=[0]*6 ss2=f3.readline(50) pos2=f3.tell() ss=ss1.split()+ss2.split() for j in range(6): sf[j]=float(ss[j]) dis=(sum([(sf[k]-sf[k+3])**2 for k in range(3)]))**(0.5) if dis<7.0 and dis>0.01: st=st+str('1,') else: st=st+str('0,') st=st+'\n' f2.write(st) f2.close() f3.close()

Program # 3: Following is a python program which writes the distance matrix in a file with suffix ‘_Dist.csv’. Distance matrix stores the distance between the ith and the jth alpha carbon atom in its ith row and jth column. This program also takes input from a file with suffix ‘_CA2’, so make sure its there in the same directory. for fin in ['1ZQC','1E76','1IC9','1R5S','1UZ9','2I7U','1GGA','1FCD','2ZW3']: #fin=input('Enter the PDB filename: ') f3=open(fin+'_CA2','r') f2=open(fin+'_Dist.csv','w') tot=len(f3.readlines()) f3.seek(0,0) pos1=0 st='' for i in range(1,tot+1): f3.seek(pos1,0) ss1=f3.readline(50) pos1=f3.tell() pos2=0 for m in range(1,tot+1): f3.seek(pos2,0) sf=[0]*6 ss2=f3.readline(50) pos2=f3.tell() ss=ss1.split()+ss2.split() for j in range(6): sf[j]=float(ss[j]) dis=(sum([(sf[k]-sf[k+3])**2 for k in range(3)]))**(0.5) #if dis<5.0 and dis>0.1: st=st+'%f,'%dis #else: # st=st+str('0,') st=st+'\n' 7|Project Report

Indian Institute of Science Education & Research, Kolkata

Semester – IV, April 2009

f2.write(st) f2.close() f3.close()

Program # 4: Following is a python program which plots a graph between the number of nodes with ‘k’ links or degrees versus number of links of a node. The output file generated is same as the name of the pdb file with suffix ‘_deg2.jpg’. This program takes the file created by program # 2 (which creates the adjacency matrix) as input. from Gnuplot import Gnuplot for fin in ['1E76','1IC9','1R5S','1UZ9','1GGA','1FCD','2ZW3','1ZQC','2I7U']: f1=open(fin+'_Graph.csv','r') f2=open(fin+'_CA2','r') f2.seek(0,0) leng=len(f2.readlines()) f1.seek(0,0) tot=len(f1.readlines()) pos=0 kk=[] f3=open(fin+'_degree2.txt','w') for i in range(1,tot+1): f1.seek(pos,0) ss=f1.readline((2*leng)+10) pos=f1.tell() st=ss.split(',') st.pop() for k in range(len(st)): st[k]=int(st[k]) for k in range(len(st)): st[k]=int(st[k]) kk.append(sum(st)) for i in kk: print>>f3,i,kk.count(i) f1.close() f2.close() f3.close() hp=Gnuplot() hp("set te jpeg") hp("set out '"+fin+"_deg2.jpg'") hp("plot [][0:] '"+fin+"_degree2.txt' w histeps") hp("unset out")

Program # 5:

8|Project Report

Indian Institute of Science Education & Research, Kolkata

Semester – IV, April 2009

Following is a python program that plots the degree of a node versus the node number. The output file generated is named as the name of the pdb file with additional suffix ‘_deg.jpg’. This also takes the file created by program # 2 (which creates the adjacency matrix) as input.

from Gnuplot import Gnuplot for fin in ['1E76','1IC9','1R5S','1UZ9','1GGA','1FCD','2ZW3','1ZQC','2I7U']: f1=open(fin+'_Graph.csv','r') f2=open(fin+'_CA2','r') f2.seek(0,0) leng=len(f2.readlines()) f1.seek(0,0) tot=len(f1.readlines()) pos=0 kk=[] f3=open(fin+'_degree.txt','w') for i in range(1,tot+1): f1.seek(pos,0) ss=f1.readline((2*leng)+10) pos=f1.tell() st=ss.split(',') st.pop() for k in range(len(st)): st[k]=int(st[k]) for k in range(len(st)): st[k]=int(st[k]) print>>f3,i,sum(st) kk.append(sum(st)) print>>f3,'# The average degree is:',(sum(kk)*1.0/leng) f1.close() f2.close() f3.close() hp=Gnuplot() hp("set te jpeg") hp("set out '"+fin+"_deg.jpg'") hp("plot [][0:] '"+fin+"_degree.txt' w histeps") hp("unset out")

Program # 6: Following is a python program that prints the length of a protein on the screen from the information extracted from the pdb file. for fin in ['1ZQC','1E76','1IC9','1R5S','1UZ9','2I7U','1GGA','1FCD','2ZW3']: f2=open(fin+'_CA2','r') f2.seek(0,0) print fin,len(f2.readlines()) f2.close()

RESULTS: Following are a few plots constructed using GNUPLOT. 9|Project Report

Indian Institute of Science Education & Research, Kolkata

Semester – IV, April 2009

Backbone of a protein (2I7U.pdb)

Big dots correspond to ‘1’ and small dots correspond to ‘0’.

Adjacency Matrix for 2I7U.pdb (Distance < 7.0 Å)

10 | P r o j e c t R e p o r t

Indian Institute of Science Education & Research, Kolkata

Short-range Interaction Network (Distance < 7.0 Å)

Interaction Network (Distance between 9.0 – 11.0 Å)

11 | P r o j e c t R e p o r t

Semester – IV, April 2009

Indian Institute of Science Education & Research, Kolkata

Semester – IV, April 2009

Long-range Interaction Network (Distance 16.0 – 18.0 Å)

CONCLUSIONS:  Thus visualization of protein networks as graphs leads to a way out of the complex problem of protein folding.  The experimental correlation between protein folding rates and correlation coefficient C is a non-trivial justification of protein network modelling.  Luckily, it seems, the techniques of topological networks predict the properties of protein to a coarse degree of approximation.  The triumph of this approach is that it can pose this problem as a simplified computational problem, using the standard results of linear algebra.

ACNOWLEDGEMENT & REFERENCES: We thank Dr. Somdatta Sinha from Centre for Cellular & Molecular Biology, Hyderabad for her great help and efforts she took to travel to our Institute for delivering a few extraordinary lectures on system biology and protein contact networks. These lectures guided us a lot. We also thank Dr. Jayasri Sarma and Dr. Tradip Ganguly for encouraging and supporting throughout the duration of the course.

12 | P r o j e c t R e p o r t

Indian Institute of Science Education & Research, Kolkata

Semester – IV, April 2009

Our references were: • www.wikipedia.org • www.google.co.in • www.rcsb.org/pdb (for information about various proteins) • Whole-proteome prediction of protein function via graph-theoretic analysis of interaction maps – By Elena Nabieva, Mona Singh, Amit Agarwal and others (2005) • Deciphering the Protein Network of Caenorhabditis elegans in the Approach of Systems Biology – By Chung-Yen Lin and others.

13 | P r o j e c t R e p o r t

Related Documents


More Documents from ""