Bioinformatics & Homology Modeling
The purpose of today's exercise is to build the structure of a protein, beginning with only the DNA sequence of the gene encoding the protein. WHOA, you say-didn't Dr. Koudelka say that we could not predict structure from sequence? Indeed you cannot, however you can build a reasonable model of a protein IF there is a structural homologue in the protein database.
The example we will be using in this exercise is dicC. DicC is a division
control protein of E. coli. We already know from a number
of studies that this protein is a DNA binding protein that recognizes specific
DNA sequences using a helix-turn-helix supersecondary structure element.
For the purposes of today's exercise, we will pretend that we did not know
what gene we had in our hot little hands.
PART I- PRELIMINARY ANALYSIS OF DNA & PROTEIN SEQUENCES
PART II BUILDING A MODEL PROTEIN
Before we begin, you will first need to copy several files that you will need to have for today's exercises.
Bring one of the blue terminal windows forward
Copy the files you will be working with
At the unix '>' prompt type
mkdir homology (this command creates the directory homology)
then type
cp /nsm/home/koudelka/homology/*<space>homology (note there
is a space AFTER cp)
then type
cd homology
If you now type ls, followed by the enter key there should be 8 files displayed.
dicCprot.seq dic_C.seq homology99_1.psv homology99_2.psv homology99_3.psv homology99_4.psv unknown.txt, p22.pdb.
PART I-PRELIMINARY ANALYSIS OF DNA & PROTEIN SEQUENCES
A. TECHNIQUES USED TO IDENTIFY GENES-DATABASE SEARCHING
Through various genetic manipulations, you have identified and cloned your candidate gene. The output from your Sequencing Facility was poor and you only have 50 bases of readable DNA sequence. That sequence is stored in a file called unknown.txt. Your first task is to search the sequence database at National Center for Biotechnology Information (NCBI).
1) Start your web browser by typing 'eng netscape'
blastp compares an amino acid query sequence against a protein sequence database
blastx compares a nucleotide query sequence translated in all reading frames against a protein sequence database
tblastn compares a protein query sequence against a nucleotide sequence database dynamically translated in all reading frames
tblastx compares the six-frame translations of a nucleotide query sequence against the six-frame translations of a nucleotide sequence database. Please note that tblastx program cannot be used with the nr database on the BLAST Web page.
Sequences producing significant alignments: (bits) Value
gi|1787841|gb|AE000253.1|AE000253 Escherichia coli K12 MG16... 157 8e-37
gi|1742585|dbj|D90800.1|D90800 E.coli genomic DNA, Kohara c... 157 8e-37
gi|1742560|dbj|D90799.1|D90799 E.coli genomic DNA, Kohara c... 157 8e-37
gi|312764|emb|X07465.1|ECDICABC E. coli genes dicA, dicB, d... 157 8e-37
gi|5705995|gb|AC000049.11|AC000049 Homo sapiens Chromosome ... 38 0.53
gi|6598396|gb|AC003680.2|AC003680 Arabidopsis thaliana chro... 36 2.1
The first set of blue highlighted text indicates the database where
the sequence is found and the "accession number" that represents the file
name of the sequence in the database. Following that is a brief description
of the sequence file. The next information is the Score, a scaled measure
of the degree of homology (identity) between your probe sequence and a
portion of the sequence in the file. The larger this number is, the higher
degree of identity there is. The last value on this line is the "E value"
that represents the probability that the probe sequence is not homologous
to the sequence from the database. The smaller this number is, the more
likely that your sequence and the sequence in the database represent the
same gene.
Query: 61 ggtattcgtttggcttcgc 79
|||||||||||||||||||
Sbjct: 209 ggtattcgtttggcttcgc 191
/note="dicC polypeptide"
/codon_start=1
/transl_table=11
/db_xref="PID:g41277"
/db_xref="SWISS-PROT:P06965"
/translation="MLKTDALLYFGSKTKLAQAAGIRLASLYSWKGDLVPEGRAMRLQ
EASGGELQYDPKVYDEYRKTKRAGRLNNENHS"
indicates that our probe or query sequence is found in the region of the database sequence that encodes the dicC polypeptide. However the sequence in the database in complementary direction (i.e., the reverse complement of the sequence in the database is the "coding sequence"). Note that this was suggested by the results of the BLAST query.
To better visualize the orientation of the coding sequence, we will reformat the display output of the sequence page. In the upper left corner of the display window, next to the 'Display' button, set the display option to Graphics. Click on the Display button. In the view that appears, the entire sequence is depicted as a multicolored bar at the top and below that a detailed view of a region of the sequence is given. Unfortunately, this view defaults to the middle of the sequence. Click near the N-terminal end of the multicolored bar at the top to view the detailed display of the region we are interested in. As can be seen, the coding sequence reads in the opposite direction.
We now have learned that we are working with a known gene encoding for a protein. Reading some of the references in the annotation will lead you to information regarding the protein's structure & function, which will come back to later. We will now turn to a more detailed analysis of the DNA & protein sequence.
B. TECHNIQUES FOR ANALYZING DNA AND PROTEIN SEQUENCES-GCG PROGRAMS.
In this part of the class, we will be using another software package owned by the CAMBI Computer Facility, the so-called Genetics Computer Group (GCG) software. This software package contains programs used for analysis of the primary sequences of both protein and DNA.
This software resides on the NSM server pinky.nsm.buffalo.edu. There are several preliminary steps that must be done in order for you to run this software successfully from the SGI machines.
In one of the "blue" terminal windows:
Enter your password (the one used on the SGI's).
Enter your password (the one used on the SGI's).
setenv DISPLAY machineyouarsittingat:0.0
The program now starts.
xwindows
Hit the ENTER key to accept the COLORWORKSTATION default.
A GCG graphics window now appears.
a. Mapping the sequence-Generating a restriction map
Type map dic_C.seq
Hit the ENTER key to default through the beginning and end of the sequence regions to be analyzed
Hit the ENTER key to default to accept all ENZYMES for the restriction map
The next subcommand asks you about protein translation frames"
What protein translations do you want:
a) frame 1 b) frame 2 c) frame 3
d) frame 4 e) frame 5 f) frame 6
t)hree forward frames s)ix frames o)pen frames only
n)o protein translation q)uit
Please select (capitalize for 3-letter) (* t *):
At the : prompt type a capital A and hit enter. This selects a three letter designation in coding frame 1.
Hit the ENTER key to accept the default filename dic_C.map.
b. Viewing the output.
Type more<space>dic_C.map
A portion of the sequence with a restriction map above and a protein translation below becomes visible. To look at the rest of the file, hit the space bar until your reach the end of the file and the unix '>' prompt appears again. Note that you also get a list of enzymes that do and do not cut your DNA sequence.
Alternatively you could type
cat<space> dic_C.map
and the entire file would appear. You can then use the window's slider bar to look at the file
Type translate dic_C.seq
Hit the ENTER key to default through the beginning and end of the sequence regions to be analyzed
Hit the ENTER key to accept the beginning and end of the sequence regions to be analyzed
Type 'w' to write the translation to a file
Hit the ENTER key to accept the default file name dic_C.pep
Now we will analyze the peptide structure based on the physical properties of its constituent amino acids
Hit the ENTER key to default through the beginning and end of the sequence regions to be analyzed
Hit the ENTER key to accept the default hydrophilicity index of Kyte-Doolittle
Hit the ENTER key to accept the default output file name dic_C.p2s
Now we will display the results of this analysis in the xwindow we opened earlier.
c. Using plotstructure
Type plotstructure dic_C.seq
Hit the ENTER key to default through the beginning and end of the sequence regions to be analyzed
Choose a 1 dimensional panel graph plot by typing '1'
Press the <ENTER> key when prompted for a <RETURN>
A plot summarizing the hydrophilicity, surface probability, helicity, b-sheet probability, and several other features of the protein primary sequence appears in the GCG xwindow. Be sure to note where the helices and sheets are predicted to occur.
d. Exiting the GCG
In the window connected to pinky, type logout
C. TECHNIQUES FOR CHOOSING STRUCTURAL HOMOLOGUES
The next step in this process is to use the sequence information, along with other information gathered along the way to try and build a model of the dicC protein based on its sequence and structural homologues.
One question you may have is how to decide what are the structural homologues of the protein. Although the computer can HELP, you must do a little literature-reading legwork on own. A way the computer CAN help is to speed up the literature search. This can be done in BLAST
Scroll back to the top of the file where the literature references are and find Reference 3
REFERENCE 3 (bases 1 to 4441)
AUTHORS Bejar,S., Bouche,F. and Bouche,J.P.
TITLE Cell division inhibition gene dicB is regulated by a locus similar
to lambdoid bacteriophage immunity loci
JOURNAL Mol. Gen. Genet. 212 (1), 11-19 (1988)
MEDLINE 88232418
PART II BUILDING A MODEL PROTEIN
Using the procedure described above, I have identified three structural homologues of the dicC protein. These are l repressor, 434 repressor, P22 repressor. I have downloaded these structures and placed them in a folder called - homology99_1.psv. We will be begin building our model protein by first examining the structures of these proteins.
1) File
Restore Folder- homology99_1.psv
The purpose of the rest of this exercise is to build a structure for dicC, which is homologous to the three we have displayed on the screen, but whose 3-D structure is unknown. This built protein will draw on known sequence and structural homologies in these HTH proteins.
a) Sequences-
c) Alignment-
Manual
Scoring Matrix® Mutation
Execute
Initialize-Click on Q117 of REPH2O and Q21 of P22R and then click on execute. The box now turns green.
Using the right mouse button, extend the box three amino acids towards the C-terminus.
Note: the movement of the structure, observe the RMS derivative and sequence homology values on the command line in the lower left part of the screen. A good structural alignment has a low RMS value, a good sequence homology has a high homology score. Move the box until it encompasses the sequence of both proteins from Q117/21 through V124/V28.
We know need to compare the sequence and structure homologies between REPH2O and LAMREPCOR and also between P22R and LAMREPCOR. Before we can do this we have to 'freeze' the sequence box we have already created.
j) Boxes
DELETE ALL OBJECTS BEFORE RESTORING NEXT FOLDER
Restore Folder-homology99_2.psv
<objectname>$SCR- contains all the structural information in the three structurally conserved regions for a particular object.
<objectname>$SCR1-3 Each contains the structural information for one structurally conserved region for a particular object.
Molecule Pick Level Subset
Superimpose -Backbone
Source P22R$SCR
Target REPH2O$SCR
Execute
Molecule Pick Level Subset
Superimpose -Backbone
Source LAMREPCOR$SCR
Target REPH2O$SCR
The two molecules now superimpose on the screen. Note the RMS difference. Which pair of molecules are more alike?
We are now finally ready to begin building the new protein's structure
File Fmt® Biosym
Execute
Another sequence line now appears at the bottom of the screen. It is, however, unaligned, both in sequence and proposed structure with the other three proteins. This is the next thing we must do. Unfortunately the alignment program in insight is terrible at recognizing more distant relationships so I am incorporating information from a GCG alignment I did on these protein sequences. You can also use published alignments at this step. To use this manual alignment strategy:
Alignment
Pairwise_Sequence
OFF
Execute
Now Click on SEQUENCE next to Mode in the Homology window.
Align the sequences by clicking USING THE MIDDLE MOUSE BUTTON and dragging the first residue of dicCprot so that it is aligned with Ile 21 of LAMREPCOR.
DO NOT DO THIS TODAY-BUT IF YOU ARE INTERESTED IN DOING AN AUTOMATIC ALIGNMENT FOLLOW THE PROCEDURE IN f)
Scoring Matrix-Mutation
Gap Penalty-15 (This is very important!!!)
Sequence 1 REPH2O
Sequence 2 dicCprot
h) Alignment
We assess sequence homology pairwise dicCprot vs. REPH2O, P22R, and LAMREPCOR individually in each of the three SCR regions. This takes some time so you will only do it for one here.
IMPORTANT In the sequence window, next to the Mode command, click on "Box"
j) Boxes
Enclose the entire SCR region with the box and record the homology score.
Which pair has the highest homology score?
Normally, we would now delete the unwanted sequence boxes using Boxes-Delete
and then repeat the process comparing the DicCprot sequence with the three
other proteins in the two remaining SCRs using a process identical to that
here. We would then delete the unwanted boxes and freeze the three good
ones. I have done that for you and here were the homology scores
| Protein | SCR1 | SCR2 | SCR3 |
| LAMREPCOR | -11 | -6.28 | -8 |
| REPH2O | 6.25 | 16.25 | -4 |
| P22R | -5 | 10 | 10 |
DELETE ALL OBJECTS BEFORE RESTORING THE NEXT FOLDER
PART III
Restore Folder-homology99_3.psv
Use the scroll slider on the right of the sequence window to bring the SCR regions into view
We will now use the SCR region of the three segments to assign coordinates to our new protein. We are using the coordinates of REPH2O for SCRs 1&2 and that of P22R for SCR 3.
a) Sequence
Simplify the display by displaying only the Ca trace of DicCprot using
Molecule
Molecular Pick Level® Molecule
Display-Only
Specified-Backbone
Execute.
Internal Overlap: 0.5
External Overlap: 0.5
Iterations: 1000
The program generates ten loops and creates them as subsets Choice$1-10. The program leaves you in Loops-Display. You now must evaluate the best loop choice for the region of interest. You can do this by inspection and RMS evaluation.
Zoom in on the region where the loop will be placed.
e) Loops
DELETE ALL OBJECTS BEFORE RESTORING THE NEXT FOLDER
PART IV
Restore Folder-homology99_4.psv
a) Refine
The molecule is still not done. Two things remain. First, the existing model of the rest of the protein contains several steric overlaps. You may have noticed messages throughout the entire exercise indicating that to you. Second the ends of the molecule almost undoubtedly do not exist as extended chains. They must be minimized into a more reasonable configuration. First, we will identify the bad contacts.
Overlap value- 0.5 (A)
Option-intra
Monitor-off (IMPORTANT)
VERY IMPORTANT!!!- Click on Fix in each area of the control panel
d) Open module Discover_3
Click the Expert button ON
Select dicCprot as the Assembly/Molecule
Execute
Choose Specify® Non-bonds
Click Dielectric-Distance Dependent
Execute
Choose Calculate® Minimize
Set Run Max Steps 1000
Newton-Off
Execute
Choose D_Run® Run
select dicCprot0 as the job
Execute
h) Background_Job
Select Control_Background_Job
Under Control Mode select Detach from Job
Click on dicCprot0 in the job window
Execute
This allows the job to proceed more quickly because the program now no longer must continually update the display.
After you receive notification that the job has finished, we will look at the minimized structure:
This restores the output of the Discover run you setup above. Notice how the molecule has changed. Confirm that Discover has done its job of removing steric overlaps by re-running Bump
Molecule-DicCprot0.cor (IMPORTANT)
Overlap value- 0.5 (A)
Monitor-off
You have now successfully created a model of the dicC protein. We will
be using this model of the dicC protein later on this semester when we
analyze the mechanisms of DNA recognition by proteins. Therefore, we will
save this structure for later use.