Sequence Assembly

Assembly is the process of taking trace files generated from the ABI 373 Sequencers, transferring the data to UNIX, and processing them to determine if they are novel yeast DNA sequences.

Not all sequences are guarranteed to be the desired yeast sequence. E. coli DNA could have been introduced during a poorly processed cosmid prep or alternatively, cosmid vector without a yeast DNA insert could have been ligated into the M13 vector during the library construction. Consequently, it is necessary to screen the DNA traces (using a screening program called GCG) against a variety of databases. This will determine the nature of the DNA sequences.

Furthermore, it is likely that not all the sequences can be determined automatically. Consequently, one aspect of the assembly process involves manually editing the sequences the software cannot read. This is termed hand-tedding.

Afterwards, the sequences that GCG and the user have determined are most likely novel yeast sequences will be assembled into a project database (using the XBAP program) containing all the previously processed novel yeast sequences from this particular project.

Moving Data From 373 (Macs connected to Sequencers) to Brad (Central Mac)

All the data that is generated from the ABI 373 machines is transferred automatically to one main Macintosh which provides easy access to all the gels that have been run. Currently, that central Macintosh is called Brad.

Retracking Gel Images

1-Open any of the results folders in the To be tracked folder on Transfer.

2-Double click on the gel image.

This will typically present you with a black gel image, because by default you have zoomed in on the bottom of the gel which has no signal.

3-Click on the "+" icon labeled zoom so that it changes to a "-" and click on the gel image.

4-Retrack the gel image as needed keeping in mind that the lane numbers on the image need to match the lane numbers on the sample sheet.

If the lanes are mislabeled, the data is still useful for assembly, however, if a trace from that gel has a problem area in it which needs to be resolved by resequencing it, the trace will not correspond correctly to the sample sheet and the incorrect template will be chosen for resequencing.

5-Once you have finished retracking the gel image, select Generate New Sample File under the Sample menu. There are two key instructions you need to give in the subsequent window are that you want to analyze modified lanes only (so you don't waste time re-analyzing unchanged lanes) and you want to overwrite the existing files with the newly analyzed data.

6-Finally you must select the files to which the new data can be written. Simply return to the To be tracked folder, open the results file containing the gel image you are presently retracking and double click on the first data file. This automati cally launches Analysis.

Analysis of each gel will take anywhere from five to twenty minutes. You can walk away at this time and come back when it is done.

7-After Analysis has finished, close the gel image and the results folder in which it resides and move that folder into the Transfer to Unix folder.

Moving Data To Unix System

8-Once all the results folders in the To be tracked folder have been retracked and moved into the Transfer to Unix folder, you can open the Transfer to Unix application on Transfer.

9-Typically all the folders are moved together, but the option exists to only move selected folders of your choosing, one at a time. Either way, Transfer must establish a connection with the Unix machine Cycle,and to do so you must enter your username and then your password to open the connection. All the folders will be transferred one at a time without requiring further action on your part, so you can walk away until the whole process is finished.

Time required for the transfer is proportional to the number of folders to be transferred.

Processing the New Data

1-Login on one of the UNIX machines. To view what has been transferred from available to be processed, type cd/share/mac/abi . Subsequently, list the folders in that directory (by typing ls). Open the folder you copied the data into by typ ing cd <foldername>.

By typing "cd share/mac/abi/<foldername>" you are accessing the files you copied from the Mac.

2- If all the trace files are from the same project number then type cd /assembly/<projectnumber>/newdata. Subsequently, list the files. If there are files present that have been processed (ie have a ,n, .i , or .info suffix) then you need to create a temporary directory to hold these files. Type mkdir temp. Subsequently, type movetraces <project#>*temp.

Avoid combining processed and unprocessed data in the newdata directory. Before moving unprocessed files into the newdata directory, check it for processed files. If any exist, they must be moved to a temporary directory. This is required because re-processing data will create file conflicts.

3-Return to the /share/mac/abi directory ( see beginning of section ). Type

mv_<project#>*_/assembly/<project#>/newdata .

The collection of files you just moved across from Brad do not necessarily belong to the same project. This step will distribute each file to the newdata directory of the appropriate project.

4-Change to the newdata directory of the project you are manipulating ( type cd /assembly/<project#>/newdata ). Type chmod_-x_*.

The new files that are moved are considered "executable" by Unix. Further manipulations of the files will not work properly if this is not changed. Typ ing "chmod_-x _*" will change executable files into non-executable files. If you do not have write permission, then you must get the person who moved the data to change the execution mode.

5-Type process-newdata _-vector_cosmid .

This command will initiate the GCG software which will screen all files in the newdata folder. This program will screen against E. coli, cosmid, and vector. In the end, the program will label these files according to source DNA (de sired yeast DNA vs. contaminant DNA), and according to the quality of the trace (usable vs. questionable vs. reject). The labels appear as suffixes following the trace files.

Post-Processing the New Data

1-Once the new trace file have been processed, type ls. The display should list all the traces with a variety of suffixes. Type the following commands movetraces_*.x_../rejects, movetraces_*.v*_../M13 , movetraces_*.o*_../cosmid, and movetraces_*.e*_../ecoli.

The output from GCG has multiple suffixes. The .x files are the traces that the computer simply could not call and gave up. The .v files are those traces that GCG found matched the sequences of the vector, the .o files are those the computer determined matched cosmid sequences, and the .e files are those which were determined to be E.coli. These files do not serve a useful purpose which is why they are moved to auxilliary folders.

2-Type ls_-1_<project#>.s*_>_toasmb . Then type movetraces_*.s*_../data and mv_toasmb_../data.

In order to progress further, it is necessary to list all the .s files into a verti cal text file called toasmb. The .s files are those traces that the computer determined were novel yeast sequences. All the .s files and the toasmb file are then moved into the data directory where all the other yeast sequences are located.

Ted-Helping

At this point, the GCG software could not determine the nature of the remaining traces in the newdata directory and should only have .n or .i suffixes. These files can not be processed because the .n files usually have too many unidentifiable bases while the .i files are missing the yeast insert site TGACTCA. Consequently, it is necessary to manually determine the nature of these traces. The process of manually calling traces is termed "Hand-tedding" and uses the "Ted -helper"program.

There are two main aspects of hand tedding. The first one involves determining if the trace is usable. In general, the beginning and end of a tracefile is poor and thus unusable. The computer automatically attempts to mark and "hide" these two ends. Thus if the trace was found usable, it becomes necessary to modify the computer called left and right limits.

1-Type ted-helper. This starts the program used to manually edit questionable traces. A window with a list of traces should appear. Choose a trace to edit by double clicking on the trace filename.

2-Examine the trace by carefully scrolling through the tracefile. Subsequently, determine if it is usable. (Usable traces are characterized by clearly and uniquely

distinguishable sharp peaks.) If unusable, click on the output box. The name of the file will appear followed by .s , the default suffix. Change the suffix to a .x . Click the okay button to save and quit the editor. Continue with the next trace from the Ted-Helper list.

Poor traces result from any errors along the production line.: picking two plaques into the same well, preparing poor template preps, failure on the Catalyst, etc.

3-If the trace is determined usable then you must modify the left end of readabil ity. First click the adjust left cut box. Next look for the insert sequence TGACTC. If found, move the cursor to the right of the insert site, and click on the mouse. If this sequence cannot be found then determine the point at which the sequence can be reliably called. and click the mouse at that point.

4-Scroll through the trace looking for computer errors or miscalls. If such mistakes do exist, click the edit sequence function. Manually change the seqeunce so that it fits the correct data. If you change base pairs, use a lowercase character so that it is possible to differentiate between computer called sequence and human called sequence.

Errors in basecalling can occur often. Sometimes, the computer simply in serts a base pair where one obviously does not exist. Other times, a whole series of extraneous peaks will appear whose amplitude is great enough to cause the computer to miscall the sequence, in situations like these, it is usually possible to see the true sequence below the erroneous ones.

5-Determine the right limit of readability. A good tool is to observe when the peaks become less sharp, decrease in amplitude, or when the computer starts calling the same base twice. When the sequence becomes unreliable, click adjust right cut. and click the mouse at that point. Then click the output box. The trace should appear with a .s suffix at the end. Simply click okay and then quit the editor.

6-Select the next trace to be edited. Continue doing this until all the .n and .i files have been hand-tedded.

7-When all the hand-tedding is done, you'll want to rescreen the questionable sequences against the cosmid, vector, and known E. coli sequences. First, type rm_-rf_*.n and rm_-rf_*.i . and rm_*.1_*.2_*.3. Only use this command if you have edited all the trace files in the current directory.

The computer may not necessarily delete the .n and .i files when you edit the traces. The computer also generates some unwanted .1 .2 .3 files. However, if you have finished hand-tedding, then all the traces will now have either a .x or a .s suffix. These two commands will eliminate the .n and .i files.

8-Type movestraces_*.x_../rejects .

9-Type ls_-1_*.s_>_toassemble. Then type movetraces_*.s_../data and mv_toassemble_../data .

This series of command will list all the files that you have determined are readable through hand-tedding and list these files into another filel called toassemble. The .s files and the toassemble text file will be subsequently moved into the data directory where all the other yeast sequences reside.

Auto-Assembling with XBAP

This next series of commands will take the new traces (hereby termed contig) and assemble it with the other contigs that have already been sequenced. At this point, you want to be sure that you are working from an xterminal hooked directly to a computer in the lab (ie cycle or vegemite). If you use the network the pro cess will be very slow.

1-Move to the data directory of the project (type cd_../data).

2-If you have moved both a toasmb and a toassemble file into the data directory then type cat_toasmb_toassemble_>_toasmb.

This command will combine these two text files into the toasmb file. As a simple check, type more _toasmb after the cat command has been executed. This shows the contents of toasmb,: you should be able to see all the contents that were originally in either toasmb or toassemble.

3-Type xbap to load the assembling program.

4-Select "yes" at the prompt asking whether or not to open an existing database. Type A<project#> when asked which database to open.

This series of commands assumes that this is not the first set of data to be entered from a particular project. If this is the first set of data, then you will create a new database calling it A<project#> rather than opening up an existing database.

5-Select the auto-assemble function under the Modification heading.

This is the function which will take the list of traces, assemble the se quences, and attempt to join the sequences that overlap into the existing database.

6-Answer "yes" to the prompt " permit entry?"

7-Choose any one of the four options to the next prompt.

8-At the prompt "use File of filenames? ", choose "yes."

This inputs the list of files to be assembled.

9-For "File of gel reading name? " prompt, type "toasamb" as the file with the list. Click on "OK".

10-The computer will ask what file to place the files that do not meet these selec tion criteria. Type "rejects". If the computer states that the file already exists choose to overwrite the existing file.

The computer will keep a list of the traces that do not meet the standards. It is not important that the file be overwritten because they will be retrieved later on in the assembly process.

11-Out of given options, select " Perform normal shotgun assembly." Click on "OK".

12-At "Permit joins?" prompt answer " yes."

13-The computer will then request a series of numerical criteria which will deter mine the stringency with which the traces can be placed into the database. Type 20, 25, 25, and 10% for the criteria.

By typing toasmb, you have designated toasmb as the file which lists all the traces to be assembled. The first criteria number designates the number of base pairs that must match in order to commence analysis. The second value is the number of permissable dashes in a region before analysis can commence, and the third value is the number of permissable dashes per gel in contig. The fourth value is the percentage of mismatches that will be tolerated before the trace is rejected.

14-The computer will ask what scan size should be used, choose the maximum value listed. The computer will begin working.

At the end, the output box will list how many of the traces were accepted into the new database, how many were actually joined to other contigs, and how many were input as other contigs.


Last Updated December 12, 1996
Email webmaster: wwwadmin@sequence.stanford.edu