Department of Bioinformatics and Computational Biology

Home > Public Software > CliPP

CliPP

hidden rowfor table layout
Overview
DescriptionClonal structure identification through penalizing pairwise differences
Development Information
GitHub wwylab/CliPP
Languagepython (>3.5.1)
Current version1.3.3
LicenseArtistic (package)
StatusActive
Last updated2025/02/09
References
Citation Yujie Jiang, Matthew D Montierth, Kaixian Yu, Shuangxi Ji, Quang Tran, Xiaoqian Liu, Jessica C Lal, Shuai Guo, Aaron Wu, Seung Jun Shin, Shaolong Cao, Ruonan Li, Yuxin Tang, Tom Lesluyes, Scott Kopetz, Pavlos Msaouel, Anil K. Sood, Christopher Jones, Jaffer Ajani, Sumit K Subudhi, Ana Aparicio, Padmanee Sharma, John Paul Shen, Marek Kimmel, Jennifer R. Wang, Maxime Tarabichi, Rebecca Fitzgerald, Peter Van Loo, Hongtu Zhu, Wenyi Wang. Subclonal mutational load predicts survival and response to immunotherapy in cancers with low to moderate TMB. https://www.biorxiv.org/content/10.1101/2024.07.03.601939v1
Help and Support
Contact Wenyi Wang 
Discussion Issues On GitHub 

CliPP

CliPP is a subclonal identification tool designed for next-generation sequecing of bulk tumor samples. It is one of the 11 participating methods in the Pan-Cancer Analysis of Whole Genome (PCAWG) working group, Heterogeneity and Evolution, of International Cancer Genome Consortium (ICGC). And the method is described in the manuscript (See Citation).

Introduction

Subpopulations of tumor cells characterized by mutation profiles may confer differential fitness and consequently influence prognosis of cancers. Understanding subclonal architecture has the potential to provide biological insight in tumor evolution and advance precision cancer treatment. Recent methods comprehensively integrate single nucleotide variants (SNVs) and copy number aberrations (CNAs) to reconstruct subclonal architecture using whole-genome or whole-exome sequencing (WGS, WES) data from bulk tumor samples. However, the commonly used Bayesian methods require a large amount of computational resources, a prior knowledge of the number of subclones, and extensive post-processing. Regularized likelihood modeling approach, never explored for subclonal reconstruction, can inherently address these drawbacks. We therefore propose a model-based method, Clonal structure identification through Pair-wise Penalization, or CliPP, for clustering subclonal mutations without prior knowledge or post-processing. The CliPP model is applicable to genomic regions with or without CNAs. CliPP demonstrates high accuracy in subclonal reconstruction through extensive simulation studies. A penalized likelihood framework for subclonal reconstruction will help address intrinsic drawbacks of existing methods and expand the scope of computational analysis for cancer evolution in large cancer genomic studies. Also see our paper: https://www.biorxiv.org/content/10.1101/2021.03.31.437383v1.

Prerequisites

Setting up CliPP

Manual Install

git clone https://github.com/wwylab/CliPP.git
cd CliPP

python setup.py build

Docker container

We also include a Dockerfile in the repository. The user can build and run a Docker contrainer of CliPP to avoid any issues caused by package dependencies. The following provides a tutorial about this.

  docker build -t clipp .
  git clone https://github.com/wwylab/CliPP.git
  cd CliPP
  docker run -v $(pwd):/Sample clipp python3 /CliPP/run_clipp_main.py -i /Sample/test /Sample/sample/sample.snv.txt /Sample/sample/sample.cna.txt /Sample/sample/sample.purity.txt

Note: -v $(pwd):/Sample is to mount the the current directory in the host (the machine itself) to the container’s directory of /Sample, so the input data in the current directory sample/sample.cna.txt, sample/sample.cna.txt and sample/sample.purity.txt can be seen by the container through /Sample/sample/sample.snv.txt, /Sample/sample/sample.cna.txt, /Sample/sample/sample.purity.txt. The output folder is located at /Sample/test in the container, which can be accessed at $(pwd)/test in the host machine.

The installation normally takes a few seconds to 1 minute.

Structure of CliPP implementation

The flow chart below shows the CliPP implementation. Raw functions and scripts are under scr/. image

Approximately, CliPP can run on samples with up to 50,000 SNVs on a machine with 256GB memory. It may require more memory when there are more SNVs. When that happens, we apply a downsampling strategy, as implemented in all other subclonal reconstruction methods.

Furthermore, CliPP is automatically run in parallel when users have multiple cores available.

Input data sample

There are three required input files:

  1. sample.snv.txt: A tab separated file which stores a data matrix with the following named columns:
  2. chromosome_index: The chromosomal location of the SNV.
  3. position: the single-nucleotide position of the SNV on the corresponding chromosome.
  4. ref_count: The number of reads covering the locus and containing the reference allele.
  5. alt_count: The number of reads covering the locus and containing the alternative allele.

  6. sample.cna.txt: A tab separated file which stores a data matrix with the following named columns:

  7. chromosome_index: The chromosome location of the CNA.

  8. start_position: the start position of the CNA segment on the corresponding chromosome.

  9. end_position: the end position of the CNA segment on the corresponding chromosome.

  10. major_cn: The copy number of the major allele in tumor cells. This should be greater than equal to the value in the minor_cn column and greater than 0.

  11. minor_cn: The copy number of the minor allele. This must be less than equal the value in the major_cn column.

  12. total_cn: The sum of major_cn and minor_cn.

  13. sample.purity.txt: A file storing a scalar purity value between 0 and 1.

A simulated sample input data is under sample/.

Running CliPP with one-step implementation

The caller function run_clipp_main.py wraps up the CliPP pipeline and enables users to implement subclonal reconstruction in one-step. To try CliPP with our sample input, you may run:

python run_clipp_main.py sample/sample.snv.txt sample/sample.cna.txt sample/sample.purity.txt
````

A full manual is as follows:

usage: run_clipp_main.py [-h] [-i SAMPLE_ID] [-e PREPROCESS] [-b] [-f FINAL] [-l LAM] [-s SUBSAMPLE_SIZE] [-n REP_NUM] [-w WINDOW_SIZE] [-o OVERLAP_SIZE] snv_input cn_input purity_input

positional arguments: snv_input Path/Filename of the snv input. cn_input Path/Filename of the copy number input. purity_input Path/Filename of the purity input.

optional arguments: -h, –help show this help message and exit -i SAMPLE_ID, –sample_id SAMPLE_ID Name of the sample being processed. Default is ‘sample_id’. -l LAM, –lam LAM The penalty parameter (lambda), which usually takes values from 0.01-0.25. If skipping this parameter, it will return a list of results that take value of [0.01, 0.03, 0.05, 0.075, 0.1, 0.125, 0.15, 0.175, 0.2, 0.225, 0.25] by default, and select a preferable one among them. -b, –subsampling Whether doing subsampling or not. Default is not doing the subsampling, and a flag -b is needed for subsampling. -p PREPROCESS, –preprocess PREPROCESS Directory that stores the preprocess results. Default name is ‘preprocess_result/‘. -f FINAL, –final FINAL Directory that stores the final results after postprocessing. Default name is ‘final_result/‘.


The followings parameters are only needed when doing subsampling. We take partitions from 0 to 1 (determined by the `WINDOW_SIZE` parameter, default at `0.05`), then take the VAF of each SNV as an initial estimation of its cellular prevalence (CP) value and randomly assign SNVs that belong to their corresponding windows. The number of sampled SNV is proportional to number of total SNVs belonging to each window. 

-s SUBSAMPLE_SIZE, –subsample_size SUBSAMPLE_SIZE (Required if doing subsampling) The number of SNVs you want to include in each subsamples. We use 40,000 for 256GB memory. -n REP_NUM, –rep_num REP_NUM (Required if doing subsampling) The number of random subsamples needed. -w WINDOW_SIZE, –window_size WINDOW_SIZE Controls the length of the window. Takes value between 0 and 1. Default is 0.05. -o OVERLAP_SIZE, –overlap_size OVERLAP_SIZE Controls the overlapped length of two consecutive windows. Takes value between 0 and 1. Default is 0. ```

The CliPP Outputs

By default, all outputs will be stored in the folder named sample_id, and this name can be changed with the -i or --sample_id option.

In final results, cluster index = 0 indicates clonal mutations, while non-zero cluster indexes indicate subclonal mutations.

The final result for CliPP is two-fold: * The subclonal structure, i.e., clustering results: cluster number, the total number of SNVs in each cluster, and the estimated CP for each cluster. * The mutation assignment, i.e., cluster id for each mutation. This output can then serve as the basis for inference of phylogenetic trees.

Occasionally, a warning file may appear in the output, which can be caused by two factors: * For SNVs where the read count information that does not match the CNA-based input data, leading to out-of-bound calculations in the CliPP model, we replace the out-of-bound values with 0.01 and these SNVs are flagged as 1. This is an extremely rare case in real data when the CNA data are good, e.g. TCGA and PCAWG data. When many 1’s are observed, the users should look into the quality of the CNA input data. * While rare, some lambda values may not produce an output. In these instances, CliPP will select the best available result and list the failing lambda values in the warning file.

Demo output

By default, CliPP processes 11 results using 11 pre-selected lambda values (0.01, 0.03, 0.05, 0.075, 0.1, 0.125, 0.15, 0.175, 0.2, 0.225, 0.25), then recommend the result that follows our pre-defined rule (see Methods from the paper). Using the simulation data from directory sample/ with the command above, CliP will output two files to the default directory sample_id/final_result/Best_lambda, with filename subclonal_structure_lam0.25.txt and mutation_assignments_lam0.25, respectively. 0.25 is the selected lambda value in this case. The first file outputs an overview of subclonal reconstruction result for this simulation sample, which is shown below. An index is assigned to each cluster, ranging from 0 to n, with clusters sorted in descending order based on their Cellular Prevalence (CP), such that the cluster with the highest CP is assigned an index of 0, the second highest an index of 1, and so on.
subclonal_structure

The second file outputs a detailed list of mutation assignment result, i.e., for each mutation, which cluster is it assigned to. The figure below shows the top 10 rows of this mutation assignment table.
mutation_assignments

The expected run time for this sample on a personal desktop computer or laptop is around 1 second or less.