1 Introduction

The DGEobj package implements an S3 class data object that represents an extension of the capabilities of the RangedSummarizedExperiment (RSE) originally developed by Martin Morgan et al. Both the DGEobj and the RSE object can capture the raw data for a differential gene expression analysis, namely a counts matrix along with associated gene and sample annotation. Additionally, the DGEobj extends this concept to support capture of downstream data objects during the capture the workflow of an analysis project. The availability of a structured data object like the DGEobj, that captures an entire workflow has multiple advantages. Sharing Differential Gene Expression (DGE) results with other colleagues is simplified because the entire analysis is encapsulated within the DGEobj and the recipient of data in this format can examine details of the analysis based on the annotation built into the DGE object. Furthermore, the DGEobj structure enables programmatic inspection of analysis results and thus enables automation of higher level integrative analyses across multiple projects.

The RSE object that inspired the DGEobj can capture as many assays as desired. An “assay” is defined as any matrix with n genes (rows) and m samples (columns). A limitation of the RSE object however is that the RSE can capture only 1 instance of row data (typically gene annotation) and 1 instance of column data (sample annotation with one row for every column of data in the assay slot). This limits the RSE in terms of its ability to hold downstream data objects because many of those objects meet the definition of row data also (e.g. Fit object, topTable output). Other types of data (e.g. design matrices, sample QC) meet the definition of column data. Thus, the DGEobj was modeled after the RSE object, but extended to accommodate multiple row and column data types. The DGEobj is thus uniquely suited to capturing the entire workflow of a DGE analysis.

2 Structure

2.1 Base Types

The DGEobj supports four distinct data types, that we refer to as “base types”:

  • assay data: dataframes or matrices of data with n rows (genes or transcripts) and m columns (samples)
  • row: a dataframe with n rows typically containing information about each gene with as many columns as needed (gene ID, gene symbols, chromosome information, etc). Other types of rowData include design matrices and fit objects for example.
  • col: a dataframe with m rows, that is, one row for each sample column in the assay slot
  • metadata: anything that doesn’t fit in one of the other slots

Fundamentally, the base type defines how an item can be subsetted.

2.2 DGEobj Nomenclature: Items, Types and Base Types

Multiple instances of a base type are accommodated by defining data types and items. Each data type is assigned a base type (e.g. geneData, GRanges, and Fit are all “types” of “baseType” = “row”).

Multiple instances of each type (with some exceptions) are supported. We therefore describe each instance of a “type” as an “item” and each item must have a user-defined item name and the item name must be unique within a DGEobj.

2.3 Unique Items

The intent of the DGEobj is that it captures the workflow and analysis results of a single dataset. As such, certain items that constitute the raw data are defined as unique and only one instance of these data types are allowed.

Three items: counts, design (except factors and sample information), and gene (or isoform, or exon) data are defined as unique. If appropriate chromosome location data (Name, Start, End, Strand) are supplied in the geneData, then a GRanges item is also created upon initializing a DGEobj.

2.4 Levels

Three levels are predefined for DGEobj objects: gene, isoform, or exon. A DGEobj may contain only one of these levels. For the sake of simplicity, throughout this document geneData is referred to which will exist in a gene-level DGEobj. Substitute isoform or exon for gene when working with that type of data.

2.5 Parentage

An analysis can become branched. For example, multiple models can be built from one dataset. Two features of the DGEobj serve to manage branched analyses. Unique types are the “counts”, “design” (sample data), and “geneData”. The DGEobj can contain multiple instances of other data types. To document the workflow and unambiguously define the relationships between data items in a branched analysis, the concept of a parent is invoked. Each data item carries a parent attribute that holds the name of the parent data item. In this way, for example, a topTable item can be linked to the contrast fit that produced it.

2.6 Original Data

The DGEobj can be subset, for instance to remove undetected genes or outlier samples. However, for some purposes it may be necessary to return to the original data. For this reason, a copy of the original data is also stored in a metadata slot. Metadata slots are carried along without subsetting so the original data may always be retrieved from the metadata slots. The item names of the original data in the metadata slot have a *"_orig"* suffix.

3 Working with DGEobjs

3.1 Creation/Initialization

Three dataframes are required to initialize a new DGEobj: counts, gene, and design data. The initDGEobj function can then be used to create a DGEobj. The data “level” must also be specified (one of “gene”, “isoform”, or “exon”).

The counts data must use the geneIDs as rownames. The gene data (rowData) should be a dataframe using geneIDs as rownames. The design data (colData) should be a dataframe with rownames matching the column names of the counts data.

Adding custom attributes (customAttr argument) is, strictly speaking, optional but highly recommended. The re-usability of the DGEobj data is only as good as the annotations. Attributes are supplied as named list of name/value pairs and it is recommended that users define and use a consistent set of names for their annotations.

myDgeObj <- initDGEobj(counts  = MyCounts,
                       rowData = MyGeneAnnotation,
                       colData = MyDesign,
                       level = "gene",
                       customAttr = list(Genome    = "Mouse.B38",
                                         GeneModel = "Ensembl.R89"))

It can be tedious to add annotation attributes via the initDGEobj function, so, to encourage extensive annotation, a more convenient annotateDGEobj() method is supplied that can also take a file of key/value pairs and also has an optional keys argument to import a subset of keys.

myDgeObj <- annotateDGEobj(myDgeObj, 
                           regfile, 
                           keys = list("ID", "Title", "Description",
                                       "Organism", "Tissue","GeneModel"))

3.2 Adding Items

Throughout analysis other data objects can be captured in the object:

myDGEobj <- addItem(myDGEobj, 
                    item     = MyDGEList,   # data object to add
                    itemName = "MyDGEList", # user-defined name
                    itemType = "DGEList",
                    parent   = "counts", 
                    funArgs  = match.call())

The item is the actual data object to add with the itemName is the user-defined name for that object. The itemType is one of the predefined types. To see a list of predefined types use the function showTypes().

The parent argument is particularly important for a branched analysis. For instance if more than one fit or normalization technique is used, the parent argument maintains the thread and unambiguously defines the workflow. The value assigned to parent should be the item name of the parent data object.

Passing match.call() to funArgs captures the function arguments from the currently running function and serves to document both the function call and the arguments used to create the data object that is being captured. It is also possible to pass a user-defined text string to the funArgs argument.

There is also an overwrite argument to the addItem() function. By default addItem() will refuse to add an itemname that already exists. Add the overwrite = TRUE argument to update an object that already exists in the DGEobj.

3.3 Batch Addition

Multiple data objects can be added to a DGEobj in one function call using the addItems() function. A typical use case for addItems() is when several different contrasts are performed from a single fit and therefore there are multiple topTable dataframes to be added to the DGEobj.

MyDgeObj <- addItems(MyDgeObj, 
                     itemList  = MyItems,  # data objects to add
                     itemTypes = MyTypes,  # user-defined names
                     parents   = MyParents,
                     itemAttr  = MyAttributes)

itemAttr is an optional named list of attributes that will be added to every item on the itemList. Note that the DGEobj has attributes and each item within the DGEobj can have its own attributes. Here the attributes are being added to the individual items.

3.4 Length and Dimensions

The length length() of a DGEobj refers to the number of data items in the DGEobj.

The dimension dim() reported for a DGEobj is the dimensions of the assays contained in the DGEobj. That is, the row dimension is the number of genes contained in the object and the column dimension is the number of samples contained in the object.

3.5 Rownames and Colnames

The rownames and colnames of the DGEobj are inherited from the first “assay” matrix in a DGEobj, typically the counts matrix for RNA-Seq data.

Note, there is no facility provided to assign rownames or colnames to a DGEobj. This is because it cannot be guaranteed that an object stored in the DGEobj has row/col assignment support.

3.6 Subsetting

Coordinates contained in square brackets can be used to subset a DGEobj, the same way that a dataframe or matrix can be subsetted. The subsetting function uses the base type to define how each data items is handled during subsetting. Metadata items are carried along unchanged.

#subset to the first 100 genes
MyDGEobj <- MyDGEobj[c(1:100), ]

#subset to the first 10 samples
MyDGEobj <- MyDGEobj[ ,c(1:10)]

#subset genes and samples
MyDGEobj <- MyDGEobj[1:100, 1:10]

3.7 Inventory

The inventory() function prints a table of the data items in a DGEobj. The output includes the item name, type, baseType, parent, class, and date created. If the verbose = TRUE argument is added, a funArgs column is also included.

inventory(DGEobj, verbose = TRUE)

Note: To retrieve just the item names of data stored in the DGEobj, use the names() function.

3.8 Adding a Data Type

A set of data types has already been defined based on data types commonly encountered in a DGE workflow. However this likely will not cover everything that someone might want to capture. Therefore, new types can be defined on a DGEobj. Each type must have a baseType and can be defined as unique or not.

# See predefined data types
showTypes(MyDgeObj)

# Add a new datatype called "sampleQC"
MyDgeObj <- newType(MyDgeObj,
                    itemType = "sampleQC",
                    baseType = "col",
                    uniqueItem = FALSE)

3.9 Accessing Data

There are several ways to extract one or more components from a DGEobj:

The getItem() function can be used to retrieve any item in a DGEobj by referencing the item name

MyCounts <- getItem(MyDgeObj, "counts")

The getItems() can be used to retrieve multiple items DGEobj by referencing the item names

MyItems <- getItems(MyDgeObj, list("counts", "geneData"))

If all the items to be retrieved are of the same type they can be retrieved using the type name:

# get all contrast results that are stored as a topTable type
MyContrasts <- getType(MyDgeObj, "topTable")

Similarly, all items of the same baseType can be retrieved using the baseType name:

MyRowData <- getBaseType(MyDgeObj, "row")

It is also possible to retrieve all the items in a DGEobj as a simple list by recasting the DGEobj as a simple list.

MyList <- as.list(MyDgeObj)

3.10 Ancillary Functions

3.10.1 showTypes()

Shows all the predefined types in a DGEobj. This shows all types, whether or not the DGEobj contains data associated with a given type.

showTypes(MyDgeObj)

3.10.2 baseType()

Returns the baseType for a given Type

baseType(MyDgeObj, "DGEList")

3.10.3 getAttribute/s()

Retrieves attribute(s) from an individual item

getAttribute(MyDgeObj, "attributeName")

# for example retrieve the formula that is stored on the design matrix
getAttribute(MyDgeObj$designMatrix, "formula")

3.10.4 rmItem()

Delete a particular item from a DGEobj

MyDgeObj <- rmItem(MyDgeObj, "itemName")

3.10.5 setAttribute/s()

Add attribute(s) to an item in a DGEobj

MyItem <- setAttribute(MyItem, myData, myName)

MyItem <- setAttributes(MyItem, list(myName1=myData1, myName2=myData2))

3.10.6 showMeta()

This function is intended to return project-oriented attributes assigned to the DGEobj using function annotateDGEobj(). It returns the attributes as a two column dataframe of Attribute/Value pairs.

MyAnnotation <- showMeta(MyDgeObj)