ml

ml contains the code base to process glycan for machine learning, construct state-of-the-art machine learning models, train them, and analyze trained models + glycan representations. It currently contains the following modules:

model_training

contains functions for training machine learning models


EarlyStopping

 EarlyStopping (patience:int=7, verbose:bool=False)

Early stops the training if validation loss doesn’t improve after a given patience

Type Default Details
patience int 7 epochs to wait after last improvement
verbose bool False whether to print messages
Returns None

train_model

 train_model (model:torch.nn.modules.module.Module,
              dataloaders:Dict[str,torch.utils.data.dataloader.DataLoader]
              , criterion:torch.nn.modules.module.Module,
              optimizer:torch.optim.optimizer.Optimizer,
              scheduler:torch.optim.lr_scheduler._LRScheduler,
              num_epochs:int=25, patience:int=50,
              mode:str='classification', mode2:str='multi',
              return_metrics:bool=False)

trains a deep learning model on predicting glycan properties

Type Default Details
model Module graph neural network for analyzing glycans
dataloaders Dict dict with ‘train’ and ‘val’ loaders
criterion Module PyTorch loss function
optimizer Optimizer PyTorch optimizer, has to be SAM if mode != “regression”
scheduler _LRScheduler PyTorch learning rate decay
num_epochs int 25 number of epochs for training
patience int 50 epochs without improvement until early stop
mode str classification ‘classification’, ‘multilabel’, or ‘regression’
mode2 str multi ‘multi’ or ‘binary’ classification
return_metrics bool False whether to return metrics
Returns Union best model from training and the training and validation metrics

training_setup

 training_setup (model:torch.nn.modules.module.Module, lr:float,
                 lr_patience:int=4, factor:float=0.2,
                 weight_decay:float=0.0001, mode:str='multiclass',
                 num_classes:int=2, gsam_alpha:float=0.0)

prepares optimizer, learning rate scheduler, and loss criterion for model training

Type Default Details
model Module graph neural network for analyzing glycans
lr float learning rate
lr_patience int 4 epochs before reducing learning rate
factor float 0.2 factor to multiply lr on reduction
weight_decay float 0.0001 regularization parameter
mode str multiclass type of prediction task
num_classes int 2 number of classes for classification
gsam_alpha float 0.0 if >0, uses GSAM instead of SAM optimizer
Returns Tuple optimizer, scheduler, criterion

train_ml_model

 train_ml_model (X_train:Union[pandas.core.frame.DataFrame,List],
                 X_test:Union[pandas.core.frame.DataFrame,List],
                 y_train:List, y_test:List, mode:str='classification',
                 feature_calc:bool=False, return_features:bool=False,
                 feature_set:List[str]=['known', 'exhaustive'], additional
                 _features_train:Optional[pandas.core.frame.DataFrame]=Non
                 e, additional_features_test:Optional[pandas.core.frame.Da
                 taFrame]=None)

wrapper function to train standard machine learning models on glycans

Type Default Details
X_train Union training data/glycans
X_test Union test data/glycans
y_train List training labels
y_test List test labels
mode str classification ‘classification’ or ‘regression’
feature_calc bool False calculate motifs from glycans
return_features bool False return calculated features
feature_set List [‘known’, ‘exhaustive’] feature set for annotations
additional_features_train Optional None additional training features
additional_features_test Optional None additional test features
Returns Union trained model and optionally features
human = [1 if k == 'Homo_sapiens' else 0 for k in df_species[df_species.Order=='Primates'].Species.values.tolist()]
X_train, X_test, y_train, y_test = general_split(df_species[df_species.Order=='Primates'].glycan.values.tolist(), human)
model_ft, _, X_test = train_ml_model(X_train, X_test, y_train, y_test, feature_calc = True, feature_set = ['terminal'],
                         return_features = True)

You provided glycans without features but did not specify feature_calc; we'll step in and calculate features with the default feature_set but feel free to re-run and change.

Calculating Glycan Features...

Training model...

Evaluating model...
Accuracy of trained model on separate validation set: 0.8722222222222222

analyze_ml_model

 analyze_ml_model (model:xgboost.sklearn.XGBModel)

plots relevant features for model prediction

Type Details
model XGBModel trained ML model from train_ml_model
Returns None
analyze_ml_model(model_ft)


get_mismatch

 get_mismatch (model:xgboost.sklearn.XGBModel,
               X_test:pandas.core.frame.DataFrame, y_test:List, n:int=10)

analyzes misclassifications of trained machine learning model

Type Default Details
model XGBModel trained ML model from train_ml_model
X_test DataFrame motif dataframe for validation
y_test List test labels
n int 10 number of returned misclassifications
Returns List misclassifications and predicted probabilities
get_mismatch(model_ft, X_test, y_test)
[('Gal(b1-4)GlcNAc(b1-6)[Gal(b1-3)]Gal(b1-4)Glc-ol', 0.8661944270133972),
 ('Man(a1-2)Man(a1-2)Man(a1-3)[Man(a1-2)Man(a1-3)[Man(a1-2)Man(a1-6)]Man(a1-6)]Man(b1-4)GlcNAc(b1-4)GlcNAc',
  0.814888596534729),
 ('Neu5Ac(a2-8)Neu5Ac(a2-3)Gal(b1-4)Glc1Cer', 0.8540824055671692),
 ('Gal(b1-4)Glc-ol', 0.7748590111732483),
 ('Gal(b1-4)GlcNAc(b1-2)[Gal(b1-4)GlcNAc(b1-?)]Man(a1-?)[GlcNAc(b1-2)Man(a1-?)]Man(b1-4)GlcNAc(b1-4)[Fuc(a1-6)]GlcNAc',
  0.9565786123275757),
 ('Gal(b1-3)GlcNAc(b1-3)Gal(b1-4)Glc-ol', 0.812296450138092),
 ('Fuc(a1-4)GlcNAc(b1-3)Gal(b1-4)[Fuc(a1-3)]Glc-ol', 0.8908309936523438),
 ('Neu5Ac(a2-6)Gal(b1-?)[Fuc(a1-?)]GlcNAc(b1-?)[Fuc(a1-?)[Gal(b1-?)]GlcNAc(b1-?)]Man(a1-3)[Gal(b1-?)GlcNAc(b1-2)Man(a1-6)]Man(b1-4)GlcNAc(b1-4)[Fuc(a1-6)]GlcNAc',
  0.612709641456604),
 ('Neu5Ac(a2-3)Gal(b1-?)GlcNAc(b1-2)Man(a1-3)[Neu5Ac(a2-3)Gal(b1-?)[Neu5Ac(a2-6)]GlcNAc(b1-2)Man(a1-6)]Man(b1-4)GlcNAc(b1-4)[Fuc(a1-6)]GlcNAc',
  0.9313767552375793),
 ('GalNAc(b1-4)Gal(b1-4)GlcNAc(b1-6)[GalNAc(a1-3)Gal(b1-3)]GalNAc',
  0.9005035161972046)]

models

describes some examples for machine learning architectures applicable to glycans. The main portal is prep_models which allows users to setup (trained) models by their string names


SweetNet

 SweetNet (lib_size:int, num_classes:int=1, hidden_dim:int=128)

*Base class for all neural network modules.

Your models should also subclass this class.

Modules can also contain other Modules, allowing them to be nested in a tree structure. You can assign the submodules as regular attributes::

import torch.nn as nn
import torch.nn.functional as F

class Model(nn.Module):
    def __init__(self) -> None:
        super().__init__()
        self.conv1 = nn.Conv2d(1, 20, 5)
        self.conv2 = nn.Conv2d(20, 20, 5)

    def forward(self, x):
        x = F.relu(self.conv1(x))
        return F.relu(self.conv2(x))

Submodules assigned in this way will be registered, and will also have their parameters converted when you call :meth:to, etc.

.. note:: As per the example above, an __init__() call to the parent class must be made before assignment on the child.

ivar training: Boolean represents whether this module is in training or evaluation mode. :vartype training: bool*
Type Default Details
lib_size int number of unique tokens for graph nodes
num_classes int 1 number of output classes (>1 for multilabel)
hidden_dim int 128 dimension of hidden layers
Returns None

LectinOracle

 LectinOracle (input_size_glyco:int, hidden_size:int=128,
               num_classes:int=1, data_min:float=-11.355,
               data_max:float=23.892, input_size_prot:int=1280)

*Base class for all neural network modules.

Your models should also subclass this class.

Modules can also contain other Modules, allowing them to be nested in a tree structure. You can assign the submodules as regular attributes::

import torch.nn as nn
import torch.nn.functional as F

class Model(nn.Module):
    def __init__(self) -> None:
        super().__init__()
        self.conv1 = nn.Conv2d(1, 20, 5)
        self.conv2 = nn.Conv2d(20, 20, 5)

    def forward(self, x):
        x = F.relu(self.conv1(x))
        return F.relu(self.conv2(x))

Submodules assigned in this way will be registered, and will also have their parameters converted when you call :meth:to, etc.

.. note:: As per the example above, an __init__() call to the parent class must be made before assignment on the child.

ivar training: Boolean represents whether this module is in training or evaluation mode. :vartype training: bool*
Type Default Details
input_size_glyco int number of unique tokens for graph nodes
hidden_size int 128 layer size for graph convolutions
num_classes int 1 number of output classes (>1 for multilabel)
data_min float -11.355 minimum observed value in training data
data_max float 23.892 maximum observed value in training data
input_size_prot int 1280 dimensionality of protein representations
Returns None

LectinOracle_flex

 LectinOracle_flex (input_size_glyco:int, hidden_size:int=128,
                    num_classes:int=1, data_min:float=-11.355,
                    data_max:float=23.892, input_size_prot:int=1000)

*Base class for all neural network modules.

Your models should also subclass this class.

Modules can also contain other Modules, allowing them to be nested in a tree structure. You can assign the submodules as regular attributes::

import torch.nn as nn
import torch.nn.functional as F

class Model(nn.Module):
    def __init__(self) -> None:
        super().__init__()
        self.conv1 = nn.Conv2d(1, 20, 5)
        self.conv2 = nn.Conv2d(20, 20, 5)

    def forward(self, x):
        x = F.relu(self.conv1(x))
        return F.relu(self.conv2(x))

Submodules assigned in this way will be registered, and will also have their parameters converted when you call :meth:to, etc.

.. note:: As per the example above, an __init__() call to the parent class must be made before assignment on the child.

ivar training: Boolean represents whether this module is in training or evaluation mode. :vartype training: bool*
Type Default Details
input_size_glyco int number of unique tokens for graph nodes
hidden_size int 128 layer size for graph convolutions
num_classes int 1 number of output classes (>1 for multilabel)
data_min float -11.355 minimum observed value in training data
data_max float 23.892 maximum observed value in training data
input_size_prot int 1000 maximum protein sequence length for padding/cutting
Returns None

NSequonPred

 NSequonPred ()

*Base class for all neural network modules.

Your models should also subclass this class.

Modules can also contain other Modules, allowing them to be nested in a tree structure. You can assign the submodules as regular attributes::

import torch.nn as nn
import torch.nn.functional as F

class Model(nn.Module):
    def __init__(self) -> None:
        super().__init__()
        self.conv1 = nn.Conv2d(1, 20, 5)
        self.conv2 = nn.Conv2d(20, 20, 5)

    def forward(self, x):
        x = F.relu(self.conv1(x))
        return F.relu(self.conv2(x))

Submodules assigned in this way will be registered, and will also have their parameters converted when you call :meth:to, etc.

.. note:: As per the example above, an __init__() call to the parent class must be made before assignment on the child.

:ivar training: Boolean represents whether this module is in training or evaluation mode. :vartype training: bool*


init_weights

 init_weights (model:torch.nn.modules.module.Module, mode:str='sparse',
               sparsity:float=0.1)

initializes linear layers of PyTorch model with a weight initialization

Type Default Details
model Module neural network for analyzing glycans
mode str sparse initialization algorithm: ‘sparse’, ‘kaiming’, ‘xavier’
sparsity float 0.1 proportion of sparsity after initialization
Returns None

prep_model

 prep_model (model_type:Literal['SweetNet','LectinOracle','LectinOracle_fl
             ex','NSequonPred'], num_classes:int,
             libr:Optional[Dict[str,int]]=None, trained:bool=False,
             hidden_dim:int=128)

wrapper to instantiate model, initialize it, and put it on the GPU

Type Default Details
model_type Literal type of model to create
num_classes int number of unique classes for classification
libr Optional None dictionary of form glycoletter:index
trained bool False whether to use pretrained model
hidden_dim int 128 hidden dimension for the model (SweetNet/LectinOracle only)
Returns Module initialized PyTorch model

processing

contains helper functions to prepare glycan data for model training


dataset_to_graphs

 dataset_to_graphs (glycan_list:List[str], labels:List[Union[float,int]],
                    libr:Optional[Dict[str,int]]=None,
                    label_type:torch.dtype=torch.int64)

wrapper function to convert a whole list of glycans into a graph dataset

Type Default Details
glycan_list List list of IUPAC-condensed glycan sequences
labels List list of labels
libr Optional None dictionary of glycoletter:index
label_type dtype torch.int64 tensor type for label
Returns List list of node/edge/label data tuples
dataset_to_graphs(["Neu5Ac(a2-3)Gal(b1-4)Glc",
                  "Fuc(a1-2)Gal(b1-3)GalNAc"], [1, 0])
[Data(edge_index=[2, 8], labels=[5], string_labels=[5], num_nodes=5, y=1),
 Data(edge_index=[2, 8], labels=[5], string_labels=[5], num_nodes=5, y=0)]

dataset_to_dataloader

 dataset_to_dataloader (glycan_list:List[str],
                        labels:List[Union[float,int]],
                        libr:Optional[Dict[str,int]]=None,
                        batch_size:int=32, shuffle:bool=True,
                        drop_last:bool=False,
                        extra_feature:Optional[List[float]]=None,
                        label_type:torch.dtype=torch.int64,
                        augment_prob:float=0.0,
                        generalization_prob:float=0.2)

wrapper function to convert glycans and labels to a torch_geometric DataLoader

Type Default Details
glycan_list List list of IUPAC-condensed glycans
labels List list of labels
libr Optional None dictionary of glycoletter:index
batch_size int 32 samples per batch
shuffle bool True shuffle samples in dataloader
drop_last bool False drop last batch
extra_feature Optional None additional input features
label_type dtype torch.int64 tensor type for label
augment_prob float 0.0 probability of data augmentation
generalization_prob float 0.2 probability of wildcarding
Returns DataLoader dataloader for training
next(iter(dataset_to_dataloader(["Neu5Ac(a2-3)Gal(b1-4)Glc",
                                 "Fuc(a1-2)Gal(b1-3)GalNAc"], [1, 0])))
DataBatch(edge_index=[2, 16], labels=[10], string_labels=[2], num_nodes=10, y=[2], batch=[10], ptr=[3])

split_data_to_train

 split_data_to_train (glycan_list_train:List[str],
                      glycan_list_val:List[str],
                      labels_train:List[Union[float,int]],
                      labels_val:List[Union[float,int]],
                      libr:Optional[Dict[str,int]]=None,
                      batch_size:int=32, drop_last:bool=False,
                      extra_feature_train:Optional[List[float]]=None,
                      extra_feature_val:Optional[List[float]]=None,
                      label_type:torch.dtype=torch.int64,
                      augment_prob:float=0.0,
                      generalization_prob:float=0.2)

wrapper function to convert split training/test data into dictionary of dataloaders

Type Default Details
glycan_list_train List training glycans
glycan_list_val List validation glycans
labels_train List training labels
labels_val List validation labels
libr Optional None dictionary of glycoletter:index
batch_size int 32 samples per batch
drop_last bool False drop last batch
extra_feature_train Optional None additional training features
extra_feature_val Optional None additional validation features
label_type dtype torch.int64 tensor type for label
augment_prob float 0.0 probability of data augmentation
generalization_prob float 0.2 probability of wildcarding
Returns Dict dictionary of train/val dataloaders
split_data_to_train(["Neu5Ac(a2-3)Gal(b1-4)Glc", "Fuc(a1-2)Gal(b1-3)GalNAc"],
                    ["Neu5Ac(a2-6)Gal(b1-4)Glc", "Fuc(a1-2)Gal(a1-3)GalNAc"],
                    [1, 0], [0,1])
{'train': <torch_geometric.loader.dataloader.DataLoader>,
 'val': <torch_geometric.loader.dataloader.DataLoader>}

inference

>can be used to analyze trained models, make predictions, or obtain glycan representations


glycans_to_emb

 glycans_to_emb (glycans:List[str], model:torch.nn.modules.module.Module,
                 libr:Optional[Dict[str,int]]=None, batch_size:int=32,
                 rep:bool=True, class_list:Optional[List[str]]=None)

Returns a dataframe of learned representations for a list of glycans

Type Default Details
glycans List list of glycans in IUPAC-condensed
model Module trained graph neural network for analyzing glycans
libr Optional None dictionary of form glycoletter:index
batch_size int 32 batch size used during training
rep bool True True returns representations, False returns predicted labels
class_list Optional None list of unique classes to map predictions
Returns Union dataframe of representations or list of predictions

get_lectin_preds

 get_lectin_preds (prot:str, glycans:List[str],
                   model:torch.nn.modules.module.Module,
                   prot_dic:Optional[Dict[str,List[float]]]=None,
                   background_correction:bool=False, correction_df:Optiona
                   l[pandas.core.frame.DataFrame]=None,
                   batch_size:int=128, libr:Optional[Dict[str,int]]=None,
                   sort:bool=True, flex:bool=False)

Wrapper that uses LectinOracle-type model for predicting binding of protein to glycans

Type Default Details
prot str protein amino acid sequence
glycans List list of glycans in IUPAC-condensed
model Module trained LectinOracle-type model
prot_dic Optional None dict of protein sequence:ESM1b representation
background_correction bool False whether to correct predictions for background
correction_df Optional None background prediction for glycans
batch_size int 128 batch size used during training
libr Optional None dict of glycoletter:index
sort bool True whether to sort prediction results descendingly
flex bool False LectinOracle (False) or LectinOracle_flex (True)
Returns DataFrame glycan sequences and predicted binding

get_Nsequon_preds

 get_Nsequon_preds (prots:List[str], model:torch.nn.modules.module.Module,
                    prot_dic:Dict[str,List[float]])

Predicts whether an N-sequon will be glycosylated

Type Details
prots List 20 AA + N + 20 AA sequences; replace missing with ‘z’
model Module trained NSequonPred-type model
prot_dic Dict dict of protein sequence:ESM1b representation
Returns DataFrame protein sequences and predicted likelihood

get_esm1b_representations

 get_esm1b_representations (prots:List[str],
                            model:torch.nn.modules.module.Module,
                            alphabet:Any)

Retrieves ESM1b representations of protein for using them as input for LectinOracle

Type Details
prots List list of protein sequences to convert
model Module trained ESM1b model
alphabet Any used for converting sequences
Returns Dict dict of protein sequence:ESM1b representation

In order to run get_esm1b_representations, you first have to run this snippet:

!pip install fair-esm import esm model, alphabet = esm.pretrained.esm1b_t33_650M_UR50S()

train_test_split

contains various data split functions to get appropriate training and test sets


hierarchy_filter

 hierarchy_filter (df_in:pandas.core.frame.DataFrame, rank:str='Domain',
                   min_seq:int=5, wildcard_seed:bool=False,
                   wildcard_list:Optional[List[str]]=None,
                   wildcard_name:Optional[str]=None, r:float=0.1,
                   col:str='glycan')

stratified data split in train/test at the taxonomic level, removing duplicate glycans and infrequent classes

Type Default Details
df_in DataFrame dataframe of glycan sequences and taxonomic labels
rank str Domain taxonomic rank to filter
min_seq int 5 minimum glycans per class
wildcard_seed bool False seed wildcard glycoletters
wildcard_list Optional None glycoletters for wildcard
wildcard_name Optional None wildcard name in IUPAC
r float 0.1 replacement rate
col str glycan column name for glycans
Returns Tuple train/val splits and mappings
train_x, val_x, train_y, val_y, id_val, class_list, class_converter = hierarchy_filter(df_species,
                                                                                       rank = 'Kingdom')
print(train_x[:10])
['Neu5Ac(a2-8)Neu5Ac(a2-6)[Gal(?1-?)GalNAc(a1-3)]GalNAc', 'Man(a1-2)Man(a1-2)Man(b1-3)GlcNAc(a1-6)Man', 'GlcNAc(b1-2)Man(a1-6)[Man(a1-3)][GlcNAc(b1-4)]Man(b1-4)GlcNAc(b1-4)[Fuc(a1-6)]GlcNAc', 'GalNAcOS(b1-4)GlcNAc(b1-2)Man(a1-3)[GalNAcOS(b1-4)GlcNAc(b1-2)Man(a1-6)]Man(b1-4)GlcNAc(b1-4)GlcNAc', 'Neu5Ac(a2-3)Gal(b1-4)GlcNAc(b1-4)[Gal(b1-4)GlcNAc(b1-2)]Man(a1-3)[Neu5Ac(a2-3)Gal(b1-4)GlcNAc(b1-2)Man(a1-6)]Man(b1-4)GlcNAc(b1-4)GlcNAc', 'Neu5Ac(a2-?)Gal(b1-4)[Fuc(a1-3)]GlcNAc(b1-2)Man(a1-?)[Fuc(a1-3)[Gal(b1-4)]GlcNAc(b1-2)Man(a1-?)]Man(b1-4)GlcNAc(b1-4)[Fuc(a1-6)]GlcNAc', '{Gal(b1-4)GlcNAc(b1-?)}{Gal(b1-4)GlcNAc(b1-?)}{Gal(b1-4)GlcNAc(b1-?)}{Neu5Ac(a2-?)}{Neu5Ac(a2-?)}Gal(b1-4)GlcNAc(b1-2)[Gal(b1-4)GlcNAc(b1-4)]Man(a1-3)[Gal(b1-4)GlcNAc(b1-2)[Gal(b1-4)GlcNAc(b1-6)]Man(a1-6)]Man(b1-4)GlcNAc(b1-4)[Fuc(a1-6)]GlcNAc', 'Glc(a1-4)Glc(a1-4)[Glc(a1-6)Glc(a1-6)]Glc(a1-4)Glc(a1-4)Glc(a1-4)Glc(a1-4)Glc(a1-4)Glc(a1-4)Glc(a1-4)Glc', 'Fuc(a1-?)Hex(?1-?)[Hex(?1-?)]GalNAc', 'Neu5Ac(a2-?)GalNAc(b1-4)GlcNAc(b1-2)Man(a1-3)[Gal(b1-4)GlcNAc(b1-2)Man(a1-6)]Man(b1-4)GlcNAc(b1-4)[Fuc(a1-6)]GlcNAc']

general_split

 general_split (glycans:List[str], labels:List[Union[float,int,str]],
                test_size:float=0.2)

splits glycans and labels into train / test sets

Type Default Details
glycans List list of IUPAC-condensed glycans
labels List list of prediction labels
test_size float 0.2 size of test set
Returns Tuple train/test splits
train_x, val_x, train_y, val_y = general_split(df_species.glycan.values.tolist(),
                                              df_species.Species.values.tolist())
print(train_x[:10])
['GlcNAc(b1-2)[GlcNAc(b1-4)]Man(a1-3)[GlcNAc(b1-2)Man(a1-6)][GlcNAc(b1-4)]Man(b1-4)GlcNAc(b1-4)[Fuc(a1-6)]GlcNAc', 'Fuc(a1-2)[GalNAc(a1-3)]Gal(b1-?)GlcNAc6S(b1-6)[Fuc(a1-2)[GalNAc(a1-3)]Gal(b1-3)]GalNAc', 'Neu5Ac(a2-?)GalNAc(b1-4)GlcNAc(b1-2)Man(a1-3)[Gal(b1-4)[Neu5Ac(a2-?)]GlcNAc(b1-2)Man(a1-6)]Man(b1-4)GlcNAc(b1-4)[Fuc(a1-6)]GlcNAc', 'Glc(a1-4)Glc(a1-4)[Glc(a1-4)Glc(a1-4)Glc(a1-6)]Glc(a1-4)Glc', 'Gal(b1-4)GlcNAc(b1-2)Man(a1-3)[Gal(b1-4)GlcNAc(b1-2)Man(a1-6)]Man(b1-4)GlcNAc(b1-4)GlcNAc', 'Fuc(a1-2)Gal(b1-4)Glc-ol', 'Neu5Gc(a2-8)Neu5Ac(a2-6)[Gal(b1-3)]GalNAc', 'Glc(b1-4)Rha(b1-3)[Glc(a1-6)]Gal', 'Glc6Ac(b1-2)Glc1Ole6Ac(b1-4)Glc6Ac', 'Neu5Ac(a2-3)[GalNAc(b1-4)]Gal(b1-4)Glc-ol']

prepare_multilabel

 prepare_multilabel (df:pandas.core.frame.DataFrame, rank:str='Species',
                     glycan_col:str='glycan')

converts a one row per glycan-species/tissue/disease association file to a format of one glycan - all associations

Type Default Details
df DataFrame dataframe with one glycan-association per row
rank str Species label column to use
glycan_col str glycan column with glycan sequences
Returns Tuple unique glycans and their label vectors