Basic GOR and NOR commands

You've installed GORpipe what next?
Now that you've installed GORpipe it's time to take a look at commonly used basic functions and commands of the GOR (and NOR) query language. In this tutorial we will work with publicly available data sets from the GTEx portal and NCBI. Variant based QTL data, patient level phenotypic data and Gene based annotation data to showcase the diversity of the GOR query language. The goal here is to start with basic data-wrangling in NOR context, then convert the data to GOR format to perform basic analysis in GOR context.
Save the following files locally
Here the file names have been changed to patient_pheno.tsv, transQTL_vars.tsv and gene_list.tsv for code clarity. Let's get started!
Basic data-wrangling with NOR
Start by reading through the first 100 lines of the phenotype file using nor along with the top command. Note the use of -h in front of the input file to treat the first line as header. If you include a # at the beginning of the first column the -h flag is not needed.
nor -h patient_pheno.tsv
| top 100
We see that the patient file contains the following four columns
SUBJID - Patient ID
SEX - Self-reported, 1=Male, 2=Female
AGE - Elapsed time since birth in years
DTHHRDY - Death classification based on the 4-point Hardy Scale, 0=Ventilator Case, 1=Violent and fast death, 2=Fast death of natural causes, 3=Intermediate death, 4=Slow death
Let's say we want to start by answering a couple of questions about the data set. For one the number of patients in each age category, here we can use the group command along with the -count flag
nor patient_pheno.tsv
| group -gc age -count
Note the use of -gc to specify the grouping column and the -sc in front of the subject id to indicate a string column. We also want to know how many patients belong to each of the five death classifications
nor patient_pheno.tsv
| group -gc DTHHRDY -count
Next we can calculate a column including the death classification description. For this action we use the calc command along with conditions specified in if-else statements using the if() function
nor patient_pheno.tsv
| calc dthhrdy_description if(dthhrdy='0','Ventilator Case',
if(dthhrdy='1','Violent and fast death',
if(dthhrdy='2','Fast death of natural causes',
if(dthhrdy='3','Intermediate death',
if(dthhrdy='4','Slow death',
'No description provided')))))
We can also select the columns of interest using the select command and filter the data set using the where command
nor patient_pheno.tsv
| select SUBJID,AGE,SEX
| where sex=1
Now lets have a look at the trans-QTL data set typing nor transQTL_vars.tsv
shows us that the data has 21 columns and consists of 29 rows of data. We are interested in combining the two data sets. Since there is no actual connection between them we will combine the data based on row number using the rownum command along with map
nor transQTL_vars.tsv
| rownum
| map -c rownum <(nor patient_pheno.tsv | rownum | columnsort rownum)
| select tissue_id,variant_id,maf,pval_nominal,pval_beta,pp4,age,dthhrdy,sex,subjid,rownum
Note the use of a nested query, the <(nor patient_pheno.tsv ...)
this is done to perform actions on the phenotype file before executing the mapping between the two data streams. Also note the use of columnsort compared to the select command at the end. Columnsort will re-order the columns specified without filtering out any columns. For combining two data sets, using map, it is necessary for the leftmost column of the right source (the phenotype file) to be the mapping column.
Note that the rownum column can be handy to easily retrieve a given row, or retrive the first or last row with certain conditions. For example adding | atmin rownum -gc tissue_id
would result in only the first row of each tissue type.
Additional basic joining methods include the multimap, merge and inset commands. The map command above will result in a comma separated value where the right source contains multiple rows/occurances for the mapping column, multimap on the other hand matches all occurrences resulting in multiple rows. The merge command will append one data source to another and inset will filter the left source based on the data in the right source.
Here is an example of using merge and a nested query that will result in all duplicated rows. If there is a mismatch between the number of columns of the left and right data source empty values are provided.
nor transQTL_vars.tsv | merge <(nor transQTL_vars.tsv)
Here is an example of using inset, adding a -b flag to the command will result in a boolean column called inSet to indicate if the filter column value is present in the right source.
nor transQTL_vars.tsv | skip 10 | top 20 | inset -c tissue_id <(nor transQTL_vars.tsv | top 20)
With the data combined lets look at different ways to aggregate the data. The group command can be used very effectively here as well. For example to retrieve the max row number per tissue type and all subject ids in a comma separated list
nor transQTL_vars.tsv
| rownum
| map -c rownum <(nor patient_pheno.tsv | rownum | columnsort rownum)
| select tissue_id,variant_id,maf,pval_nominal,pval_beta,pp4,age,dthhrdy,sex,subjid,rownum
| group -gc tissue_id -max -ic rownum -lis -sc subjid
| rename lis_(.*) #{1}
Notice the use of the rename command, removing the lis_ prefix automatically generated by the group command per individual flag. The granno command is another useful grouping and annotation command that performs aggregation on a given column without collapsing rows
nor transQTL_vars.tsv
| rownum
| map -c rownum <(nor patient_pheno.tsv | rownum | columnsort rownum)
| select tissue_id,variant_id,maf,pval_nominal,pval_beta,pp4,age,dthhrdy,sex,subjid,rownum
| granno -gc tissue_id -avg -ic maf
Here we are also aggregating on tissue_id but calculating the average allele frequency per tissue type. By adding something like| where avg_maf > 0.3
we can filter on the data.
Finally the pivot and unpivot commands can be used to aggregate and convert the data. A very simple unpivot example would be to unpivot on the tissue_id column, resulting in two columns col_name and col_value
nor transQTL_vars.tsv
| rownum
| map -c rownum <(nor patient_pheno.tsv | rownum | columnsort rownum)
| select tissue_id,variant_id,maf,pval_nominal,pval_beta,pp4,age,dthhrdy,sex,subjid,rownum
| unpivot 1-
| distinct
Adding the distinct command will return only unique rows. Here you could easily add some QC methods such as removing all values that are empty | where Col_Value != ''
or counting all different types of column values | group -gc col_name -sc col_value -count
. The pivot command is a little more complex as you have to provide the different types you want to map the row based data on over to horizontal format. Here we categorise the subjects in the data set as either case or control and pivot based on these two types
nor transQTL_vars.tsv
| rownum
| map -c rownum <(nor patient_pheno.tsv | rownum | columnsort rownum)
| select tissue_id,variant_id,maf,pval_nominal,pval_beta,pp4,age,dthhrdy,sex,subjid,rownum
| calc type if(sex=2 and age='60-69','case','ctrl')
| pivot type -v case,ctrl -gc subjid -e 'missing'
Finally before moving over to genomically ordered data here are a couple of useful string functions
nor transQTL_vars.tsv
| rownum
| map -c rownum <(nor patient_pheno.tsv | rownum | columnsort rownum)
| select tissue_id,variant_id,maf,pval_nominal,pval_beta,pp4,age,dthhrdy,sex,subjid,rownum
/* listsize returns the size the elements in a list based on a given separator */
| calc list_size listsize(variant_id, '_')
/* Replace an element in a list with another */
| replace tissue_id replace(tissue_id,'_',',')
/* Filter the list based on a given condition */
| where listhasany(tissue_id, 'Blood')
/* Replace based on if/else, retrieving the first element of a list if the condition is true */
| replace tissue_id if(maf < 0.3, listfirst(tissue_id), tissue_id)
/* combine multiple column values into one, using a given separator*/
| calc new_col cols2list('maf,rownum,list_size', '-')
/* combine multiple lists into one, based on a filter condition applied to each element in the list */
| calc new_list listzip(variant_id,listfilter(tissue_id, "x!='Whole'"))
Note how comments are represented within the GOR query language /* a GOR comment */.
Basic GOR queries
In order to ensure a valid GOR file there are two conditions that need to be met
- The data needs to contain as the first four columns: CHROM, POS, REF, ALT for variant based data and CHROM, Gene_start, Gene_stop for gene based data
- The data has to be genomically ordered by chromosome and position, this is achieved using the sort genome command.
Lets start by converting our combined data set to GOR format and write out the file. Since we want to be in GOR context we use gor instead of nor, note however that all the commands and functions covered above can also be used in GOR context - along with many additional ones!
We start by using colsplit to generate the required data columns from the variant_id column
create ##gor_format## = gor <(nor transQTL_vars.tsv
| rownum
| map -c rownum <(nor patient_pheno.tsv | rownum | columnsort rownum)
| select tissue_id,variant_id,maf,pval_nominal,pval_beta,pp4,age,dthhrdy,sex,subjid,rownum
| colsplit variant_id 4 col -s '_'
| columnsort col_*
| rename col_1 CHROM
| rename col_2 POS
| rename col_3 REF
| rename col_4 ALT
| hide variant_id)
| sort genome;
gor [##gor_format##] | write comb_data.gorz
Notice the use of the create statement to create an intermediate data table, referenced below in square brackets. The hide command is then used to remove the variant_id column from the data set as it is no longer needed. Here we again use the nested query function, to first perform some actions in nor context, then applying the sort genome command in GOR context after the nested query. To fully showcase the power of GOR we generate a couple of additional files and write them out and sore on .gorz, compressed GOR format, you can see the data below in the appendix.
Now that we have data on GOR format we can quickly seek the file based on chromosome, variant or a given range using the -p flag
gor -p chr9:97776641 comb_data.gorz
We can also retrieve a list of individuals with a given variant with an easy "one-liner" query
gor comb_data.gorz
| group 1 -gc ref,alt -lis -sc subjid
Same goes for retrieving the maximum allele frequency per 1000 base pairs
gor comb_data.gorz
| atmax 1000 maf
The most common way to combine two GOR files is using the varjoin command, when both files are variant based. For example we can quickly annotate the combined results with VEP annotations
gor comb_data.gorz
| varjoin vep_annotations.gorz
Adding -l -r after the varjoin will perform left-join overlap, showing all rows in the left source and reduce the output columns from the right source. This query takes less than a second and annotates the data with four new columns.
We can also annotate the variant file with a gene based file using the join command along with a -snpseg flag
gor comb_data.gorz
| join -snpseg -rprefix genes genes.gorz
Notice the use of the -rprefix flag, this will add a given prefix to all columns from the right source. We can also combine all three data sets
gor comb_data.gorz
| varjoin -l -r -rprefix VEP vep_annotations.gorz
| join -snpseg -xl vep_gene_symbol -xr gene_symbol -rprefix GENES genes.gorz
Here we add the -l -r flags, discussed above, along with the -xl -xr to perform additonal equipment-join on the gene symbols from the two data sets. The variations of options to include with both varjoin and join commands are many, for more information check out the documentation. As mentioned above all actions covered in NOR context can be used in GOR context just as easily, such as aggregating or applying list functions
gor comb_data.gorz
| varjoin -l -r -rprefix VEP vep_annotations.gorz
| join -snpseg -xl vep_gene_symbol -xr gene_symbol -rprefix GENES genes.gorz
| group 1 -gc ref,alt -lis -sc VEP_consequence,GENES_gene_symbol
| replace lis_* replace(listfilter(#rc,'x!="PTCSC2"'),',',', ')
| replace lis_VEP_Consequence listdist(LISTSORTDESC(lis_VEP_Consequence))
Notice the replace with the #rc, here we are applying the replace command to all columns starting with the lis_ prefix and the listfilter() function will repeat for all columns and all elements within the column list. Finally we can also annotate a GOR file using the map command
gor comb_data.gorz
| join -snpseg -f 10 genes.gorz
| map -c gene_symbol <(nor vep_annotations.gorz | select gene_symbol,impact,consequence)
Notice the use of -f to make the join a fuzzy join with a fuzzfactor of ten bases. Here the VEP annotation GOR file is simply used in NOR context and the gene_symbol column the mapping column. Performing this annotation using varjoin is however much faster and more effective. As you can see querying genomically ordered data is a breeze with the GOR query language!
Test your knowledge!
Here are some fun excercises to test your knowledge and further your GORstanding!
- Select the first seven columns and rename them with a "GOR_" prefix.
- Calculate a column called phenotype to categorise individuals as a case or control. Cases should be those that are 30-39 years old and died of natural causes.
- Return the male/female ratio of the phenotype data set.
- Return the average pp4 value per tissue type and all other columns as comma separated lists.
- Try out the sort command to order the data in the tissue_id column in a descending order.
- Return the complete number of variants in the comb_data.gorz data set.
- Return the number of variants in each segment.
You should now be ready to take on more advanced GOR queries, deep diving into the GORdb data structure and the stand-alone Sequence Miner that comes complete with reference data on GOR format - stay tuned for the next blog post!
Happy GORing!
Appendix
Ensembl VEP annotations on GOR format, vep_annotations.gorz
CHROM POS REF ALT rsid Consequence IMPACT gene_symbol
chr10 60017871 C T rs10994139 intergenic_variant MODIFIER -
chr10 97692560 C T rs75040573 intergenic_variant MODIFIER -
chr11 7089006 C G rs11041169 5_prime_UTR_variant MODIFIER RBMXL2
chr11 7089006 C G rs11041169 regulatory_region_variant MODIFIER -
chr12 323500 G C rs3213751 downstream_gene_variant MODIFIER KDM5A
chr12 323500 G C rs3213751 intron_variant MODIFIER KDM5A
chr12 323500 G C rs3213751 upstream_gene_variant MODIFIER KDM5A
chr12 53778939 G A rs2365675 intron_variant MODIFIER -
chr12 53778939 G A rs2365675 non_coding_transcript_variant MODIFIER -
chr12 97393987 G T rs7304337 intergenic_variant MODIFIER -
chr13 43507920 A C rs10047763 intron_variant MODIFIER ENOX1
chr14 100404148 T TA rs57266307 intron_variant MODIFIER WDR25
chr14 100404148 T TA rs57266307 NMD_transcript_variant MODIFIER WDR25
chr14 100404148 T TA rs57266307 non_coding_transcript_variant MODIFIER WDR25
chr14 100404148 T TA rs57266307 regulatory_region_variant MODIFIER -
chr14 100404148 T TA rs57266307 upstream_gene_variant MODIFIER -
chr15 51302734 C T rs2445758 intron_variant MODIFIER CYP19A1
chr15 51302734 C T rs2445758 regulatory_region_variant MODIFIER -
chr15 101775997 T G rs1810232 downstream_gene_variant MODIFIER DNM1P47
chr15 101775997 T G rs1810232 intron_variant MODIFIER DNM1P47
chr15 101775997 T G rs1810232 non_coding_transcript_variant MODIFIER DNM1P47
chr15 101775997 T G rs1810232 intron_variant MODIFIER GOLGA8VP
chr15 101775997 T G rs1810232 non_coding_transcript_variant MODIFIER GOLGA8VP
chr15 101775997 T G rs1810232 upstream_gene_variant MODIFIER RN7SL209P
chr16 22081226 A G rs62047199 3_prime_UTR_variant MODIFIER MOSMO
chr16 22081226 A G rs62047199 downstream_gene_variant MODIFIER -
chr16 22081226 A G rs62047199 downstream_gene_variant MODIFIER MOSMO
chr16 22081226 A G rs62047199 intron_variant MODIFIER MOSMO
chr16 22081226 A G rs62047199 NMD_transcript_variant MODIFIER MOSMO
chr16 22081226 A G rs62047199 upstream_gene_variant MODIFIER MOSMO
chr16 36169633 C T rs1435502613 intergenic_variant MODIFIER -
chr21 36872872 G A rs73385297 intron_variant MODIFIER HLCS
chr21 36872872 G A rs73385297 NMD_transcript_variant MODIFIER HLCS
chr22 45952150 C T rs28678280 intron_variant MODIFIER WNT7B
chr3 21783666 G C rs12638755 intron_variant MODIFIER ZNF385D
chr3 21783666 G C rs12638755 NMD_transcript_variant MODIFIER ZNF385D
chr3 166217640 T C rs9851410 intergenic_variant MODIFIER -
chr5 2361524 A C rs114258117 intergenic_variant MODIFIER -
chr5 112877374 G T rs2545167 3_prime_UTR_variant MODIFIER REEP5
chr5 112877374 G T rs2545167 downstream_gene_variant MODIFIER REEP5
chr5 112877374 G T rs2545167 intron_variant MODIFIER SRP19
chr5 112877374 G T rs2545167 NMD_transcript_variant MODIFIER SRP19
chr5 112877374 G T rs2545167 non_coding_transcript_variant MODIFIER SRP19
chr5 133632835 C T rs6596132 intergenic_variant MODIFIER -
chr6 33912023 G A rs4711366 intron_variant MODIFIER -
chr6 33912023 G A rs4711366 non_coding_transcript_variant MODIFIER -
chr6 33912023 G A rs4711366 non_coding_transcript_exon_variant MODIFIER -
chr6 33912023 G A rs4711366 regulatory_region_variant MODIFIER -
chr6 163381871 T G rs9347747 intergenic_variant MODIFIER -
chr9 65074798 G A rs963551270 intergenic_variant MODIFIER -
chr9 97776641 A G rs10818050 intron_variant MODIFIER PTCSC2
chr9 97776641 A G rs10818050 non_coding_transcript_variant MODIFIER PTCSC2
chr9 97793827 A G rs965513 intron_variant MODIFIER PTCSC2
chr9 97793827 A G rs965513 non_coding_transcript_variant MODIFIER PTCSC2
chr9 137801650 G C rs112780689 downstream_gene_variant MODIFIER EHMT1
chr9 137801650 G C rs112780689 intron_variant MODIFIER EHMT1
chr9 137801650 G C rs112780689 NMD_transcript_variant MODIFIER EHMT1
chr9 137801650 G C rs112780689 non_coding_transcript_variant MODIFIER EHMT1
NCBI gene list on GOR format, genes.gorz
CHROM gene_start gene_stop Gene_Symbol Ensembl_Gene_ID
chr11 7088998 7091148 RBMXL2 ENSG00000170748
chr12 280057 389320 KDM5A ENSG00000073614
chr13 43213130 43786972 ENOX1 ENSG00000120658
chr14 100376485 100530303 WDR25 ENSG00000176473
chr15 51208057 51338596 CYP19A1 ENSG00000137869
chr15 101764629 101767023 DNM1P47
chr15 101775281 101779514 GOLGA8VP
chr16 22008111 22092652 MOSMO ENSG00000185716
chr21 36748625 36990211 HLCS ENSG00000159267
chr22 45920366 45977162 WNT7B ENSG00000188064
chr3 21412218 22372763 ZNF385D ENSG00000151789
chr5 112861287 112898371 SRP19 ENSG00000153037
chr5 112876385 112922227 REEP5 ENSG00000129625
chr6 33889511 33896907 LINC01016 ENSG00000249346
chr9 97699625 97853080 PTCSC2 ENSG00000236130
chr9 137619005 137836127 EHMT1 ENSG00000181090