The ability to reproduce biomedical research findings is essential for advancing science (https://www.nih.gov/research-training/rigor-reproducibility). Although the reproducibility of biological experiments and clinical trials are key aspects of reproducibility, so too is the ability to reproduce the data analyses that discovered useful information in the data from those experiments or informed any conclusions and/or decisions based on those experiments.
In theory, data analyses should be reproducible exactly, but in practice that is much easier said than done. Any change in the software used in the analysis, any change in parameters or data (including auxiliary data), or any non-deterministic aspects of the analysis (such as the order in which parallel computations execute) could potentially alter the results of an analysis. If complete records of an analysis are not available, determining exactly how to reproduce it exactly is a complex, perhaps infeasible process, known as forensic bioinformatics.
The analysts in this department perform thousands of analyses on behalf of hundreds of PIs from across MD Anderson every year. We frequently get requests, perhaps years after the original analysis, for more details about specific aspects of the analysis, or to repeat an analysis with updated or new data.
FjORD is an enterprise information system that enables an analyst team to easily manage a large assortment of reproducible data analyses. In particular, it aims to simplify the collection, storage, and retrieval of the details needed to reproduce a large number of data analyses.
FjORD does not constrain how any analysis is performed or the technology used to make it reproducible. For instance, one analysis could be based on an RStudio project while another could use a Python workbook.
FjORD requires that each analysis as well as each data set will be saved in a version control system. (Currently, git.) Thus, if an analysis is ever modified (for instance, if updated input data is available), users will be able to look at the original analysis result as well as any updates that have been saved.
FjORD assumes that each data set or analysis is mostly self-contained, but may be related to other data sets and analyses in the system.
FjORD is currently being used by statistical analysts in the departments of Biostatistics and Bioinformatics and Computational Biology. We are planning a public release in the near future.