RNASeq analysis pipline

00-Using the conda environment

conda create --name rnaseq
conda activate rnaseq
conda install -c bioconda trim-galore

Prepare test.fq.gz(PE) files for test, we sample top 20000 lines(5k reads) for furthor analysis


zcat Tcell001/01_Rawdata/CD8_A2_R1.fastq.gz | head -n 20000 > test/Test_sample2_R1_fq
zcat Tcell001/01_Rawdata/CD8_A2_R2.fastq.gz | head -n 20000 > test/Test_sample2_R2_fq

01-Construction of the reference genome

We can download the human or mouse genome and gtf files from Genecode,and you certainly have other ways, such as NCBI, etc. I prefer to download from this site.

hisat2 uses enhancer bwa algorithm.The input file must be genome.fa format

# hisat2 
nohup hisat2-build -p 8 GRCm38.p6.genome.fa hisat2/mm10 &

02-Quality control of raw sequencing reads


fastqc -t 10  -o /01_Rawdata/fastqc test1.fq.gz

03-Removal of adapters and low quality reads

RNAseq Workflow

Here I provide a hallworkflow using Linux.

gtf=/home/yulab/nino/Reference/RNAseq/mmu
core=8
# conda activate rnaseq
# Run the project dir
# pro_dir=/home/yulab/nino/RNAseq/Project/Tcell001
pro_dir=`pwd`
mkdir ${pro_dir}/02_trim_galore  ${pro_dir}/03_align  ${pro_dir}/04_featurecounts  ${pro_dir}/05_exp
#############################################################
echo "Now Extract sample_info.txt"
sample=$(awk 'NR>2{print line}{line=$0} END{print line}' ${pro_dir}/sample_info.txt | awk '{print $1}')
R1=`ls ${pro_dir}/01_Rawdata/*R1*`
R2=`ls ${pro_dir}/01_Rawdata/*R2*`
#############################################################
echo "Run fastqc"
begin_time=`date`
mkdir ${pro_dir}/01_Rawdata/fastqc
ls ${pro_dir}/01_Rawdata/*fastq.gz | while read ID
do echo ${ID}
        fastqc -t ${core}  -o ${pro_dir}/01_Rawdata/fastqc ${ID}
        sleep 5s
done
end_time=`date`
echo "Fastqc begin time:"
echo ${begin_time}
echo "Fastqc end time:"
echo ${end_time}

QA

  1. when constructing hisat2 index, warning: Encountered reference sequence with only gaps