Utils
This module contains utility functions for various C2S-related workflows, such as rank transformation, train/test splitting, and tokenization during finetuning.
- utils.generate_vocabulary(adata)
Create a vocabulary dictionary, where each key represents a single gene token and the value represents the number of non-zero cells in the provided count matrix.
- Parameters:
adata – an AnnData object to generate cell sentences from. Expects that obs correspond to cells and vars correspond to genes
- Returns:
a dictionary of gene vocabulary
- utils.concat_vocabularies(vocabulary_list)
Helper function to concatenate multiple vocabulary ordered dictionaries. Preserves order of features in the first vocabulary, and appends any additional features from successive dictionaries.
- utils.generate_sentences(adata, vocab, delimiter=' ', random_state=42)
Transform expression matrix to sentences. Sentences contain gene “words” denoting genes with non-zero expression. Genes are ordered from highest expression to lowest expression.
- Parameters:
adata – an AnnData object to generate cell sentences from. Expects that obs correspond to cells and vars correspond to genes
vocab – an OrderedDict which as feature names as keys and counts as values
random_state – sets the numpy random state for splitting ties
- Returns:
a numpy.ndarray of sentences, split by delimiter.
- utils.get_benchmark_df(normalized_expression: ndarray, rank_normalized_expression: ndarray, exclude_zeros: bool = True)
Build pandas DataFrame with normalized expression and ranks
- utils.sort_transcript_counts(raw_data)
Sort transcript counts, yielding matrix of ranks
- utils.benchmark_expression_conversion(benchmark_output_dir: str, save_name: str, normalized_expression_matrix, sample_size: int = 1024)
Helper function to take a normalized counts matrix and compute rank transformation and inverse transformation metrics. Saves plots and metrics to a subfolder called save_name + ‘_benchmark’.
- Parameters:
benchmark_output_dir – directory to store results to (subdirectory will be created)
save_name – name of dataset being benchmarked
normalized_expression_matrix – numpy matrix of normalized counts
sample_size – number of cells to sample for computing metrics and plots
- utils.build_arrow_dataset(cell_names: list, sentences: list, adata, label_col_names: list)
Build an arrow dataset from a list of cell IDs and cell sentences. Optionally include columns for additional cell metadata.
- Parameters:
cell_names – list of strings representing (unique) cell identifiers
sentences – list of strings representing cell sentences
adata – anndata.AnnData object
label_col_names – list of column names in .obs DataFrame to save into dataset along with cell sentences
- Returns:
Arrow dataset
- utils.train_test_split_arrow_ds(arrow_ds)
Helper function to split an arrow dataset into train, val, and test sets with a 80/10/10 split ratio.
- Parameters:
arrow_ds – arrow dataset to split
- Returns:
Tuple of i) dataset dictionary with train, val, and test splits, and ii) dictionary of indices of cells in each split
- utils.tokenize_loss_on_response(examples, tokenizer, ignore_token_id: int = -100)
Tokenize LLM input + response, loss taken only on model response.
- utils.tokenize_all(examples, tokenizer)
Tokenize LLM input + response, loss taken on all tokens.
- utils.post_process_generated_cell_sentences(cell_sentence: str, vocab_list: list, replace_nonsense_string: str = 'NOT_A_GENE')
Helper function to replace non-gene words in generated sentences and deal with duplicated genes by averaging their positions in the sentence.
- Parameters:
cell_sentence_str – string representing a cell sentence
vocab_list – list of all gene feature names, expression vector will be ordered following this list
replace_nonsense_string – word to replace non-gene words with (warning will be removed from generated cell sentences, do not choose a gene name)
- Returns:
Tuple of i) post processed cell sentence gene list and ii) number of non-genes replaced
- utils.reconstruct_expression_from_cell_sentence(cell_sentence_str: str, delimiter: str, vocab_list: list, slope: float, intercept: float)
Helper function to reconstruct an expression vector from a cell sentence.
- Parameters:
cell_sentence_str – string representing a cell sentence
delimiter – character which separates gene names in the cell sentence
vocab_list – list of all gene feature names, expression vector will be ordered following this list
slope – slope of linear model fit on log rank vs log expression
intercept – intercept of linear model fit on log rank vs log expression
- Returns:
Expression vector numpy array