Getting Started with gepabds • gepabds

Introduction

The gepabds package provides tools for analyzing gene expression data stored in SummarizedExperiment objects.

This vignette shows how to compute expression statistics for selected genes.

Loading the Data:


# load data
data(example_se)

example_se
#> class: SingleCellExperiment 
#> dim: 200 30 
#> metadata(0):
#> assays(2): counts logcounts
#> rownames(200): KBTBD4 ZNF423 ... MMAA SLC31A2
#> rowData names(2): symbol chr
#> colnames(30): D31-5_83 D31-4_92 ... D31-1_87 D31-3_64
#> colData names(4): label donor plate sizeFactor
#> reducedDimNames(0):
#> mainExpName: endogenous
#> altExpNames(1): ERCC

Exploring the data:

# Explore data
# View gene names
rownames(example_se)
#>   [1] "KBTBD4"         "ZNF423"         "CCDC74A"        "TBX19"         
#>   [5] "ORAI3"          "NOVA1"          "TPSAB1"         "ARHGEF35"      
#>   [9] "OGFR"           "TNPO1"          "DNM1P35"        "PLCB4"         
#>  [13] "USP25"          "POPDC2"         "SRGAP2"         "CDIPT"         
#>  [17] "LOC389332"      "RSPH4A"         "PRMT2"          "SCARNA9L"      
#>  [21] "ALG10"          "ZXDA"           "ABO"            "FAM46A"        
#>  [25] "MOSPD2"         "LRRC37A11P"     "WDR83"          "DHX34"         
#>  [29] "SMO"            "FRMD6-AS1"      "FGF10"          "NDUFV1"        
#>  [33] "AAK1"           "RESP18"         "SNAP25"         "ARL6IP1"       
#>  [37] "STX1B"          "KCNC3"          "LAIR1"          "RWDD2B"        
#>  [41] "RTBDN"          "SPRYD7"         "SMG6"           "CAPSL"         
#>  [45] "ACOT2"          "HPCAL1"         "TGIF2-C20orf24" "CAPZA1"        
#>  [49] "CCDC112"        "ABHD5"          "IFNGR1"         "EMC1"          
#>  [53] "FPGT-TNNI3K"    "SCLY"           "SMIM6"          "SH3BGRL3"      
#>  [57] "HAUS4"          "LINC00693"      "FAM171B"        "IL1R2"         
#>  [61] "TSPAN11"        "KLF13"          "MAGEB5"         "FAM189A1"      
#>  [65] "ZC3H10"         "ZNRF2"          "NLRX1"          "MAN2A2"        
#>  [69] "SCCPDH"         "ZBTB21"         "P2RY6"          "ZSCAN30"       
#>  [73] "PIM1"           "C1S"            "LOC100505679"   "PARP10"        
#>  [77] "CNGB3"          "GTF2H5"         "PRKCDBP"        "CAMKK1"        
#>  [81] "PANK1"          "IRF2BPL"        "PHF12"          "SLC35C2"       
#>  [85] "LOC100506421"   "GALNT9"         "UBXN10"         "TNFSF9"        
#>  [89] "STAG2"          "TAOK3"          "LOC100130348"   "ZNF687"        
#>  [93] "PPP1R12A"       "ZNF350"         "LIME1"          "LIN7B"         
#>  [97] "EIF3B"          "COX16"          "C5orf34"        "IFT46"         
#> [101] "STK32C"         "SLC6A17"        "ASZ1"           "MORN5"         
#> [105] "TIAL1"          "TNS4"           "CDC16"          "MAP6"          
#> [109] "DCTN5"          "EP300"          "CHST3"          "NPHP3"         
#> [113] "CDK11B"         "PGAP1"          "CLCA4"          "SYNDIG1"       
#> [117] "MRPL21"         "ATG3"           "FKBP1AP1"       "CREG1"         
#> [121] "DPT"            "DDX10"          "EEF1E1-MUTED"   "RNF111"        
#> [125] "CLOCK"          "MRPL4"          "MAP3K14-AS1"    "PIWIL2"        
#> [129] "UGT2B15"        "C2orf15"        "BCAS2"          "HSPA7"         
#> [133] "CCNA1"          "EFNA4"          "KCNA2"          "RPS10"         
#> [137] "KIFC1"          "A1BG"           "LOC100289473"   "RALGPS1"       
#> [141] "LOC100506195"   "ZNF490"         "LOC100505695"   "CLTCL1"        
#> [145] "PIGO"           "NOP16"          "ATP6AP2"        "C1orf227"      
#> [149] "SLC8A2"         "FRAS1"          "TMEM39B"        "HAAO"          
#> [153] "LRRC2-AS1"      "C6orf62"        "DCAF17"         "MRPL11"        
#> [157] "HNF4A"          "GBAP1"          "RNF148"         "MEIS3P1"       
#> [161] "LOC641746"      "ZCCHC3"         "DLK1"           "SERINC1"       
#> [165] "TEX21P"         "ACOX3"          "MSH5-SAPCD1"    "OCIAD1"        
#> [169] "CA13"           "FXR1"           "CCT2"           "CD276"         
#> [173] "ESCO1"          "EMID1"          "NACC1"          "SPSB1"         
#> [177] "PGBD3"          "DBNL"           "MED13L"         "ZC3H7B"        
#> [181] "LOC79015"       "FAM86C2P"       "IQCH"           "ZNF702P"       
#> [185] "FUT10"          "TMPRSS11D"      "HDLBP"          "RHPN1-AS1"     
#> [189] "PTPN6"          "HOXA10"         "CRYM"           "FAM228A"       
#> [193] "ZDHHC22"        "GPRASP2"        "GORASP1"        "NFX1"          
#> [197] "FGR"            "BTBD7"          "MMAA"           "SLC31A2"

# View sample metadata
colData(example_se)
#> DataFrame with 30 rows and 4 columns
#>                label       donor    plate sizeFactor
#>          <character> <character> <factor>  <numeric>
#> D31-5_83 mesenchymal         D31        5   1.484996
#> D31-4_92        beta         D31        4   0.760032
#> D31-6_69      acinar         D31        6   1.077424
#> D29-7_30          pp         D29        7   0.590132
#> D30-8_64       alpha         D30        8   1.676675
#> ...              ...         ...      ...        ...
#> D30-8_87 endothelial         D30        8   1.809580
#> D29-2_47          pp         D29        2   0.512505
#> D31-7_47        beta         D31        7   0.688036
#> D31-1_87       alpha         D31        1   1.196185
#> D31-3_64      acinar         D31        3   2.222911

Genes of interest

This step selects a small subset of genes from the dataset to demonstrate downstream analysis.

In real analyses, genes of interest are typically selected based on biological relevance or statistical criteria, such as:

high mean expression across samples high variability between conditions known involvement in a pathway or disease process

# genes of interest
genes_to_use <- rownames(example_se)[1:5]
genes_to_use
#> [1] "KBTBD4"  "ZNF423"  "CCDC74A" "TBX19"   "ORAI3"

Computing Expression Stats

This function calculates summary statistics for each gene, including: - mean expression across samples - variance (how variable expression is) - optionally filtered results for selected genes

These metrics help prioritize genes that are biologically informative.


# compute expression statistics

result <- compute_expr_stats(example_se, genes = genes_to_use)

result
#>       gene   cell_type mean_expr median_expr detection_rate n_cells
#> 1   KBTBD4 mesenchymal 0.0000000   0.0000000      0.0000000       2
#> 2   KBTBD4        beta 0.0000000   0.0000000      0.0000000       3
#> 3   KBTBD4      acinar 0.0000000   0.0000000      0.0000000       6
#> 4   KBTBD4          pp 0.0000000   0.0000000      0.0000000       4
#> 5   KBTBD4       alpha 0.0000000   0.0000000      0.0000000      10
#> 6   KBTBD4 endothelial 0.0000000   0.0000000      0.0000000       2
#> 7   KBTBD4       delta 0.0000000   0.0000000      0.0000000       1
#> 8   KBTBD4        duct 0.0000000   0.0000000      0.0000000       2
#> 9   ZNF423 mesenchymal 0.3719604   0.3719604      0.5000000       2
#> 10  ZNF423        beta 0.0000000   0.0000000      0.0000000       3
#> 11  ZNF423      acinar 0.0000000   0.0000000      0.0000000       6
#> 12  ZNF423          pp 0.0000000   0.0000000      0.0000000       4
#> 13  ZNF423       alpha 0.0000000   0.0000000      0.0000000      10
#> 14  ZNF423 endothelial 1.1493168   1.1493168      1.0000000       2
#> 15  ZNF423       delta 0.0000000   0.0000000      0.0000000       1
#> 16  ZNF423        duct 0.0000000   0.0000000      0.0000000       2
#> 17 CCDC74A mesenchymal 0.6169727   0.6169727      0.5000000       2
#> 18 CCDC74A        beta 0.0000000   0.0000000      0.0000000       3
#> 19 CCDC74A      acinar 0.0000000   0.0000000      0.0000000       6
#> 20 CCDC74A          pp 0.3075665   0.0000000      0.2500000       4
#> 21 CCDC74A       alpha 0.1232688   0.0000000      0.1000000      10
#> 22 CCDC74A endothelial 0.0000000   0.0000000      0.0000000       2
#> 23 CCDC74A       delta 0.0000000   0.0000000      0.0000000       1
#> 24 CCDC74A        duct 0.0000000   0.0000000      0.0000000       2
#> 25   TBX19 mesenchymal 0.3133253   0.3133253      0.5000000       2
#> 26   TBX19        beta 0.0000000   0.0000000      0.0000000       3
#> 27   TBX19      acinar 0.0000000   0.0000000      0.0000000       6
#> 28   TBX19          pp 0.0000000   0.0000000      0.0000000       4
#> 29   TBX19       alpha 0.0000000   0.0000000      0.0000000      10
#> 30   TBX19 endothelial 0.0000000   0.0000000      0.0000000       2
#> 31   TBX19       delta 0.0000000   0.0000000      0.0000000       1
#> 32   TBX19        duct 0.5766949   0.5766949      0.5000000       2
#> 33   ORAI3 mesenchymal 0.0000000   0.0000000      0.0000000       2
#> 34   ORAI3        beta 1.0610933   1.2130741      0.6666667       3
#> 35   ORAI3      acinar 0.2475599   0.0000000      0.3333333       6
#> 36   ORAI3          pp 0.3075665   0.0000000      0.2500000       4
#> 37   ORAI3       alpha 0.5745060   0.0000000      0.4000000      10
#> 38   ORAI3 endothelial 0.3178523   0.3178523      0.5000000       2
#> 39   ORAI3       delta 0.0000000   0.0000000      0.0000000       1
#> 40   ORAI3        duct 0.0000000   0.0000000      0.0000000       2

Interpreting the results:

Each row corresponds to a gene. Genes with higher mean expression are more abundant across samples, while higher variance suggests condition-specific regulation.

In downstream analyses, we often focus on genes that are both highly expressed and variable, as these are more likely to be biologically meaningful.