Linux Quick Tip

The du (disk usage) command is a Unix/Linux command that estimates file space usage.

The command:

du -hsc *

lists the file and directory sizes of the current working directory.

  • h = show sizes in human readable format (1K, 1M, 1G, …)
  • s = summarize the totals for each argument
  • c = display a grand total

For example running du -hsc * in my current directory lists all the sam files, their sizes, and the total size of the directory.


I’m running a HISAT2 alignment, so when I checked a few minutes later, I can see the sam files are bigger and closer to being done.


This is also an easy way to check if a file is empty or how file sizes compare to each other.

Quick Python Tutorial For Begginers

When you do bioinformatics, sometimes you have to spend a lot of time just converting files from one format to another. Today, I had a .gct file (gene cluster text file format) that I needed to simplify down to a regular tab separated file format.

The problem: I had a row of over 4000 sample names that don’t indicate the replicate. So I needed to add the replicate names to each sample name.

I needed to take this:

“STK11_p.G56W   STK11_p.G56W   STK11_p.G56W   STK11_p.G56W   STK11_p.G56W STK11_p.G56W   STK11_p.G56W   STK11_p.G56W   SERPINB5_p.A7T   SERPINB5_p.A7T SERPINB5_p.A7T   SERPINB5_p.A7T   SERPINB5_p.A7T   SERPINB5_p.A7T SERPINB5_p.A7T   SERPINB5_p.A7T…”

And turn it into:

“STK11_p.G56W_rep1   STK11_p.G56W_rep2   STK11_p.G56W_rep3   STK11_p.G56W_rep4 STK11_p.G56W_rep5   STK11_p.G56W_rep6   STK11_p.G56W_rep7   STK11_p.G56W_rep8 SERPINB5_p.A7T_rep1   SERPINB5_p.A7T_rep2   SERPINB5_p.A7T_rep3 SERPINB5_p.A7T_rep4   SERPINB5_p.A7T_rep5   SERPINB5_p.A7T_rep6   SERPINB5_p.A7T_rep7   SERPINB5_p.A7T_rep8…”

Because I have over 4000 of them, I needed to create a python script that does this for me!


name_file = open("/Users/Alexis/Desktop/samplenames.txt", "r")
new_name_file = open("/Users/Alexis/Desktop/new_samplenames.txt", "w")

for line in name_file:
   sample_list = line.split("\t") #split the line by tabs
   dict = {}                      #create an empty dictionary 
   new_sample_list = []           #create empty list for renamed samples

   #initialize dictionary
   for sample in sample_list: 
      dict[sample] = 1            #because I want to start naming at rep1
   #add indexes
   for item in sample_list:
      this_item = str(item + '_rep' + str(dict[item]))
      dict[item] += 1
      new_sample_list.append(this_item)   #add the renamed sample to list

#join the list together by tabs and write to the output file 

Dictionaries are useful data structures in python. Dictionaries are sets of "key:value" pairs, where each key must be unique. So, when I set dict[sample] = 1 in the for loop it goes through each unique sample name (key) and sets it equal to 1 (value). This is handy for the problem here because it allows us to only deal with the unique sample names by ignoring the repeats of each sample.

In the next for loop, we use the dictionary which was initialized at 1 to create the new sample name = item_rep1.

dict[item] +=1 adds 1 to the current value in the dictionary, making it 2, which in the second round in the loop will then make item_rep2. When the loop reaches a new sample name, it starts back at 1.

I'm sure there are many ways to solve this problem, but this is how I did it. Comment if you have any alternative methods!