Preprocessing Immune Profiling Data

We fork the preprocessing codes from the huARdb paper (https://doi.org/10.1093/nar/gkab857) for convenience in introducing the dataset.

We use cellranger-6.1.2 (https://support.10xgenomics.com/single-cell-gene-expression/software/overview/welcome) to 10x process single-cell GEX/TCR library and generate the raw data. Below are example scripts for running 10x scRNA/TCR-seq data.

The raw output file from CellRanger for RNA and VDJ library should be placed in the separated folders. For example, the RNA library should be placed in the folder named “RNA” and the VDJ library should be placed in the folder named “VDJ”. The directory strucutre should look like:

 1  ├── EXAMPLE_SAMPLE
 2  │   ├── GEX/ProcessedData/outs
 3  │   │   ├── filtered_feature_bc_matrix
 4  │   │   │   ├── barcodes.tsv
 5  │   │   │   ├── features.tsv
 6  │   │   │   └── matrix.mtx
 7  │   │   ├── ...
 8  │   ├── VDJ/ProcessedData/outs
 9  │   │   ├── filtered_contig_annotations.csv
10  │   │   ├── all_contig_annotations.csv
11  │   │   ├── ...
12  │   ├── ...

You can simply use our preprocessing pipeline to preprocess the data.

Note

The preprocessing pipeline is only tested on Linux and Mac OS. It is not tested on Windows. Please report any issue to the scAtlasVAE repository (https://github.com/WanluLiuLab/scAtlasVAE/issues).

 1  from scatlasvae.preprocessing._preprocess import *
 2  # VDJPreprocessingV1Human is for human data
 3  # VDJPreprocessingV1Mouse is for mouse data
 4  pp = VDJPreprocessingV1Human(
 5      cellranger_gex_output_path = "./EXAMPLE_SAMPLE/RNA/",
 6      cellranger_vdj_output_path = "./EXAMPLE_SAMPLE/VDJ/",
 7      output_path = "./EXAMPLE_SAMPLE_OUTPUT_DIR/"
 8  )
 9  pp.process(
10      r_path = "/opt/anaconda3/envs/r403/bin/Rscript",
11      ref_data_path = HSAP_REF_DATA["Tonly"]
12  )

HSAP_REF_DATA['Tonly'] contains all cell type annotations for T cells:

  • Central memory CD8 T cells

  • Effector memory CD8 T cells

  • Follicular helper T cells

  • MAIT cells

  • Naive CD4 T cells

  • Naive CD8 T cells

  • T regulatory cells

  • Terminal effector CD4 T cells

  • Terminal effector CD8 T cells

  • Terminal effector CD8 T cellsMAIT cells

  • Th1 cells

  • Th1/Th17 cells

  • Th17 cells

  • Th2 cells

HSAP_REF_DATA['TBonly'] contains additional cell type annotations for B cells:

  • Plasmablasts

  • Naive B cells

  • Non-switched memory B cells

Note

You should have R installed and the Rscript executable should be in the PATH. You can also specify the path to the Rscript executable by the argument “r_path”. Make sure your R version is >= 4.0.3 and the following R packages are installed:

  • Seurat.

  • DoubletFinder.

  • dplyr.

  • SingleR .

  • zellkonverter.

  • SeuratDisk.

The output files should include the following files:

1  ├── EXAMPLE_SAMPLE_OUTPUT_DIR
2  │   ├── all_contig_annotations.json
3  │   ├── filterDoublet_prediction_sce.h5ad
4  │   ├── results_preprocessed.h5ad
5  │   ├── results_raw.h5ad
6  │   ├── results_single_chain_preprocessed.h5ad
7  │   ├── results_single_chain_raw.h5ad
8  │   ├── rscript.R
  • The results_raw.h5ad is the raw data file.

  • The results_preprocessed.h5ad is the preprocessed data file with extra chain TCRs (or BCRs).

  • The results_single_chain_raw.h5ad is the raw data file for single chain TCRs (or BCRs).

  • The results_single_chain_preprocessed.h5ad is the preprocessed data file for single chain TCRs (or BCRs).

  • The filterDoublet_prediction_sce.h5ad is the doublet prediction file.

  • The all_contig_annotations.json is the json file for the VDJ annotation.

  • The rscript.R is the R script for the preprocessing.