Preprocessing Immune Profiling Data
We fork the preprocessing codes from the huARdb paper (https://doi.org/10.1093/nar/gkab857) for convenience in introducing the dataset.
We use cellranger-6.1.2 (https://support.10xgenomics.com/single-cell-gene-expression/software/overview/welcome) to 10x process single-cell GEX/TCR library and generate the raw data. Below are example scripts for running 10x scRNA/TCR-seq data.
The raw output file from CellRanger for RNA and VDJ library should be placed in the separated folders. For example, the RNA library should be placed in the folder named “RNA” and the VDJ library should be placed in the folder named “VDJ”. The directory strucutre should look like:
1 ├── EXAMPLE_SAMPLE
2 │ ├── GEX/ProcessedData/outs
3 │ │ ├── filtered_feature_bc_matrix
4 │ │ │ ├── barcodes.tsv
5 │ │ │ ├── features.tsv
6 │ │ │ └── matrix.mtx
7 │ │ ├── ...
8 │ ├── VDJ/ProcessedData/outs
9 │ │ ├── filtered_contig_annotations.csv
10 │ │ ├── all_contig_annotations.csv
11 │ │ ├── ...
12 │ ├── ...
You can simply use our preprocessing pipeline to preprocess the data.
Note
The preprocessing pipeline is only tested on Linux and Mac OS. It is not tested on Windows. Please report any issue to the scAtlasVAE repository (https://github.com/WanluLiuLab/scAtlasVAE/issues).
1 from scatlasvae.preprocessing._preprocess import *
2 # VDJPreprocessingV1Human is for human data
3 # VDJPreprocessingV1Mouse is for mouse data
4 pp = VDJPreprocessingV1Human(
5 cellranger_gex_output_path = "./EXAMPLE_SAMPLE/RNA/",
6 cellranger_vdj_output_path = "./EXAMPLE_SAMPLE/VDJ/",
7 output_path = "./EXAMPLE_SAMPLE_OUTPUT_DIR/"
8 )
9 pp.process(
10 r_path = "/opt/anaconda3/envs/r403/bin/Rscript",
11 ref_data_path = HSAP_REF_DATA["Tonly"]
12 )
HSAP_REF_DATA['Tonly'] contains all cell type annotations for T cells:
Central memory CD8 T cells
Effector memory CD8 T cells
Follicular helper T cells
MAIT cells
Naive CD4 T cells
Naive CD8 T cells
T regulatory cells
Terminal effector CD4 T cells
Terminal effector CD8 T cells
Terminal effector CD8 T cellsMAIT cells
Th1 cells
Th1/Th17 cells
Th17 cells
Th2 cells
HSAP_REF_DATA['TBonly'] contains additional cell type annotations for B cells:
Plasmablasts
Naive B cells
Non-switched memory B cells
Note
You should have R installed and the Rscript executable should be in the PATH. You can also specify the path to the Rscript executable by the argument “r_path”. Make sure your R version is >= 4.0.3 and the following R packages are installed:
Seurat.
DoubletFinder.
dplyr.
SingleR .
zellkonverter.
SeuratDisk.
The output files should include the following files:
1 ├── EXAMPLE_SAMPLE_OUTPUT_DIR
2 │ ├── all_contig_annotations.json
3 │ ├── filterDoublet_prediction_sce.h5ad
4 │ ├── results_preprocessed.h5ad
5 │ ├── results_raw.h5ad
6 │ ├── results_single_chain_preprocessed.h5ad
7 │ ├── results_single_chain_raw.h5ad
8 │ ├── rscript.R
The
results_raw.h5adis the raw data file.The
results_preprocessed.h5adis the preprocessed data file with extra chain TCRs (or BCRs).The
results_single_chain_raw.h5adis the raw data file for single chain TCRs (or BCRs).The
results_single_chain_preprocessed.h5adis the preprocessed data file for single chain TCRs (or BCRs).The
filterDoublet_prediction_sce.h5adis the doublet prediction file.The
all_contig_annotations.jsonis the json file for the VDJ annotation.The
rscript.Ris the R script for the preprocessing.