Extracting exonic and intronic regions from a bam file

 

Step 1: Making separate bed files for the intronic & exonic regions

screen-shot-2017-02-05-at-8-50-56-pm

  • Then a second page will appear where you  select “Introns plus” and leave as 0 to just get the introns
  • select “get BED”
  • then repeat, renaming the file to UCSC_exons_hg19.tsv and on the second page selecting “Exons plus”
  • you can use the UCSC genome browser to confirm the files have only introns (or only exons)

Step 2: Extracting the specific regions from your bam file

This samtools command will only output alignments overlapping the input BED file.

The -h option = output file will include the header. The -L option = output alignments overlapping the input BED file.

Only output alignments overlapping introns:

samtools view -h -L UCSC_Introns_hg19_.bed  sample.bam > sample_introns.bam

Only output alignments overlapping exons:

samtools view -h -L UCSC_exons_hg19_new.bed sample.bam > sample_exons.bam

Linux Quick Tip

The du (disk usage) command is a Unix/Linux command that estimates file space usage.

The command:

du -hsc *

lists the file and directory sizes of the current working directory.

  • h = show sizes in human readable format (1K, 1M, 1G, …)
  • s = summarize the totals for each argument
  • c = display a grand total

For example running du -hsc * in my current directory lists all the sam files, their sizes, and the total size of the directory.

screen-shot-2016-12-19-at-4-30-45-pm

I’m running a HISAT2 alignment, so when I checked a few minutes later, I can see the sam files are bigger and closer to being done.

screen-shot-2016-12-19-at-4-32-06-pm

This is also an easy way to check if a file is empty or how file sizes compare to each other.

Quick Python Tutorial For Begginers

When you do bioinformatics, sometimes you have to spend a lot of time just converting files from one format to another. Today, I had a .gct file (gene cluster text file format) that I needed to simplify down to a regular tab separated file format.

The problem: I had a row of over 4000 sample names that don’t indicate the replicate. So I needed to add the replicate names to each sample name.

I needed to take this:

“STK11_p.G56W   STK11_p.G56W   STK11_p.G56W   STK11_p.G56W   STK11_p.G56W STK11_p.G56W   STK11_p.G56W   STK11_p.G56W   SERPINB5_p.A7T   SERPINB5_p.A7T SERPINB5_p.A7T   SERPINB5_p.A7T   SERPINB5_p.A7T   SERPINB5_p.A7T SERPINB5_p.A7T   SERPINB5_p.A7T…”

And turn it into:

“STK11_p.G56W_rep1   STK11_p.G56W_rep2   STK11_p.G56W_rep3   STK11_p.G56W_rep4 STK11_p.G56W_rep5   STK11_p.G56W_rep6   STK11_p.G56W_rep7   STK11_p.G56W_rep8 SERPINB5_p.A7T_rep1   SERPINB5_p.A7T_rep2   SERPINB5_p.A7T_rep3 SERPINB5_p.A7T_rep4   SERPINB5_p.A7T_rep5   SERPINB5_p.A7T_rep6   SERPINB5_p.A7T_rep7   SERPINB5_p.A7T_rep8…”

Because I have over 4000 of them, I needed to create a python script that does this for me!

 


name_file = open("/Users/Alexis/Desktop/samplenames.txt", "r")
new_name_file = open("/Users/Alexis/Desktop/new_samplenames.txt", "w")

for line in name_file:
   sample_list = line.split("\t") #split the line by tabs
   dict = {}                      #create an empty dictionary 
   new_sample_list = []           #create empty list for renamed samples

   #initialize dictionary
   for sample in sample_list: 
      dict[sample] = 1            #because I want to start naming at rep1
   
   #add indexes
   for item in sample_list:
      this_item = str(item + '_rep' + str(dict[item]))
      dict[item] += 1
      new_sample_list.append(this_item)   #add the renamed sample to list

#join the list together by tabs and write to the output file 
new_name_file.write("\t".join(new_sample_list))

Dictionaries are useful data structures in python. Dictionaries are sets of "key:value" pairs, where each key must be unique. So, when I set dict[sample] = 1 in the for loop it goes through each unique sample name (key) and sets it equal to 1 (value). This is handy for the problem here because it allows us to only deal with the unique sample names by ignoring the repeats of each sample.

In the next for loop, we use the dictionary which was initialized at 1 to create the new sample name = item_rep1.

dict[item] +=1 adds 1 to the current value in the dictionary, making it 2, which in the second round in the loop will then make item_rep2. When the loop reaches a new sample name, it starts back at 1.

I'm sure there are many ways to solve this problem, but this is how I did it. Comment if you have any alternative methods!

First blog post

After reading this blog post about the benefits of science blogging, I figured I’d give it a shot. I know I can benefit from the opportunity to sort my ideas and learn by teaching. I plan to do some beginner bioinformatics tutorials and document my journey towards a PhD!