Utils

This module contains utility functions for various C2S-related workflows, such as rank transformation, train/test splitting, and tokenization during finetuning.

utils.generate_vocabulary(adata)

Create a vocabulary dictionary, where each key represents a single gene token and the value represents the number of non-zero cells in the provided count matrix.

Parameters:: adata – an AnnData object to generate cell sentences from. Expects that obs correspond to cells and vars correspond to genes
Returns:: a dictionary of gene vocabulary

utils.concat_vocabularies(vocabulary_list): Helper function to concatenate multiple vocabulary ordered dictionaries. Preserves order of features in the first vocabulary, and appends any additional features from successive dictionaries.

utils.generate_sentences(adata, vocab, delimiter=' ', random_state=42)

Transform expression matrix to sentences. Sentences contain gene “words” denoting genes with non-zero expression. Genes are ordered from highest expression to lowest expression.

Parameters:

adata – an AnnData object to generate cell sentences from. Expects that obs correspond to cells and vars correspond to genes
vocab – an OrderedDict which as feature names as keys and counts as values
random_state – sets the numpy random state for splitting ties

Returns:

a numpy.ndarray of sentences, split by delimiter.

utils.get_benchmark_df(normalized_expression: ndarray, rank_normalized_expression: ndarray, exclude_zeros: bool = True): Build pandas DataFrame with normalized expression and ranks

utils.sort_transcript_counts(raw_data): Sort transcript counts, yielding matrix of ranks

utils.benchmark_expression_conversion(benchmark_output_dir: str, save_name: str, normalized_expression_matrix, sample_size: int = 1024)

Helper function to take a normalized counts matrix and compute rank transformation and inverse transformation metrics. Saves plots and metrics to a subfolder called save_name + ‘_benchmark’.

Parameters:

benchmark_output_dir – directory to store results to (subdirectory will be created)
save_name – name of dataset being benchmarked
normalized_expression_matrix – numpy matrix of normalized counts
sample_size – number of cells to sample for computing metrics and plots

utils.build_arrow_dataset(cell_names: list, sentences: list, adata, label_col_names: list)

Build an arrow dataset from a list of cell IDs and cell sentences. Optionally include columns for additional cell metadata.

Parameters:

cell_names – list of strings representing (unique) cell identifiers
sentences – list of strings representing cell sentences
adata – anndata.AnnData object
label_col_names – list of column names in .obs DataFrame to save into dataset along with cell sentences

Returns:

Arrow dataset

utils.train_test_split_arrow_ds(arrow_ds)

Helper function to split an arrow dataset into train, val, and test sets with a 80/10/10 split ratio.

Parameters:: arrow_ds – arrow dataset to split
Returns:: Tuple of i) dataset dictionary with train, val, and test splits, and ii) dictionary of indices of cells in each split

utils.tokenize_loss_on_response(examples, tokenizer, ignore_token_id: int = -100): Tokenize LLM input + response, loss taken only on model response.

utils.tokenize_all(examples, tokenizer): Tokenize LLM input + response, loss taken on all tokens.

utils.post_process_generated_cell_sentences(cell_sentence: str, vocab_list: list, replace_nonsense_string: str = 'NOT_A_GENE')

Helper function to replace non-gene words in generated sentences and deal with duplicated genes by averaging their positions in the sentence.

Parameters:

cell_sentence_str – string representing a cell sentence
vocab_list – list of all gene feature names, expression vector will be ordered following this list
replace_nonsense_string – word to replace non-gene words with (warning will be removed from generated cell sentences, do not choose a gene name)

Returns:

Tuple of i) post processed cell sentence gene list and ii) number of non-genes replaced

utils.reconstruct_expression_from_cell_sentence(cell_sentence_str: str, delimiter: str, vocab_list: list, slope: float, intercept: float)

Helper function to reconstruct an expression vector from a cell sentence.

Parameters:

cell_sentence_str – string representing a cell sentence
delimiter – character which separates gene names in the cell sentence
vocab_list – list of all gene feature names, expression vector will be ordered following this list
slope – slope of linear model fit on log rank vs log expression
intercept – intercept of linear model fit on log rank vs log expression

Returns:

Expression vector numpy array