ml

ml contains the code base to process glycan for machine learning, construct state-of-the-art machine learning models, train them, and analyze trained models + glycan representations. It currently contains the following modules:

model_training contains functions for training machine learning models
models describes some examples for machine learning architectures applicable to glycans
processing contains helper functions to prepare glycan data for model training
inference can be used to analyze trained models, make predictions, or obtain glycan representations
train_test_split contains various data split functions to get appropriate training and test sets

model_training

contains functions for training machine learning models

EarlyStopping

 EarlyStopping (patience:int=7, verbose:bool=False)

Early stops the training if validation loss doesn’t improve after a given patience

	Type	Default	Details
patience	int	7	epochs to wait after last improvement
verbose	bool	False	whether to print messages
Returns	None

train_model

 train_model (model:torch.nn.modules.module.Module,
              dataloaders:Dict[str,torch.utils.data.dataloader.DataLoader]
              , criterion:torch.nn.modules.module.Module,
              optimizer:torch.optim.optimizer.Optimizer,
              scheduler:torch.optim.lr_scheduler._LRScheduler,
              num_epochs:int=25, patience:int=50,
              mode:str='classification', mode2:str='multi',
              return_metrics:bool=False)

trains a deep learning model on predicting glycan properties

	Type	Default	Details
model	Module		graph neural network for analyzing glycans
dataloaders	Dict		dict with ‘train’ and ‘val’ loaders
criterion	Module		PyTorch loss function
optimizer	Optimizer		PyTorch optimizer, has to be SAM if mode != “regression”
scheduler	_LRScheduler		PyTorch learning rate decay
num_epochs	int	25	number of epochs for training
patience	int	50	epochs without improvement until early stop
mode	str	classification	‘classification’, ‘multilabel’, or ‘regression’
mode2	str	multi	‘multi’ or ‘binary’ classification
return_metrics	bool	False	whether to return metrics
Returns	Union		best model from training and the training and validation metrics

training_setup

 training_setup (model:torch.nn.modules.module.Module, lr:float,
                 lr_patience:int=4, factor:float=0.2,
                 weight_decay:float=0.0001, mode:str='multiclass',
                 num_classes:int=2, gsam_alpha:float=0.0)

prepares optimizer, learning rate scheduler, and loss criterion for model training

	Type	Default	Details
model	Module		graph neural network for analyzing glycans
lr	float		learning rate
lr_patience	int	4	epochs before reducing learning rate
factor	float	0.2	factor to multiply lr on reduction
weight_decay	float	0.0001	regularization parameter
mode	str	multiclass	type of prediction task
num_classes	int	2	number of classes for classification
gsam_alpha	float	0.0	if >0, uses GSAM instead of SAM optimizer
Returns	Tuple		optimizer, scheduler, criterion

train_ml_model

 train_ml_model (X_train:Union[pandas.core.frame.DataFrame,List],
                 X_test:Union[pandas.core.frame.DataFrame,List],
                 y_train:List, y_test:List, mode:str='classification',
                 feature_calc:bool=False, return_features:bool=False,
                 feature_set:List[str]=['known', 'exhaustive'], additional
                 _features_train:Optional[pandas.core.frame.DataFrame]=Non
                 e, additional_features_test:Optional[pandas.core.frame.Da
                 taFrame]=None)

wrapper function to train standard machine learning models on glycans

	Type	Default	Details
X_train	Union		training data/glycans
X_test	Union		test data/glycans
y_train	List		training labels
y_test	List		test labels
mode	str	classification	‘classification’ or ‘regression’
feature_calc	bool	False	calculate motifs from glycans
return_features	bool	False	return calculated features
feature_set	List	[‘known’, ‘exhaustive’]	feature set for annotations
additional_features_train	Optional	None	additional training features
additional_features_test	Optional	None	additional test features
Returns	Union		trained model and optionally features

human = [1 if k == 'Homo_sapiens' else 0 for k in df_species[df_species.Order=='Primates'].Species.values.tolist()]
X_train, X_test, y_train, y_test = general_split(df_species[df_species.Order=='Primates'].glycan.values.tolist(), human)
model_ft, _, X_test = train_ml_model(X_train, X_test, y_train, y_test, feature_calc = True, feature_set = ['terminal'],
                         return_features = True)


You provided glycans without features but did not specify feature_calc; we'll step in and calculate features with the default feature_set but feel free to re-run and change.

Calculating Glycan Features...

Training model...

Evaluating model...
Accuracy of trained model on separate validation set: 0.8722222222222222

analyze_ml_model

 analyze_ml_model (model:xgboost.sklearn.XGBModel)

plots relevant features for model prediction

	Type	Details
model	XGBModel	trained ML model from train_ml_model
Returns	None

analyze_ml_model(model_ft)

get_mismatch

 get_mismatch (model:xgboost.sklearn.XGBModel,
               X_test:pandas.core.frame.DataFrame, y_test:List, n:int=10)

analyzes misclassifications of trained machine learning model

	Type	Default	Details
model	XGBModel		trained ML model from train_ml_model
X_test	DataFrame		motif dataframe for validation
y_test	List		test labels
n	int	10	number of returned misclassifications
Returns	List		misclassifications and predicted probabilities

get_mismatch(model_ft, X_test, y_test)

[('Gal(b1-4)GlcNAc(b1-6)[Gal(b1-3)]Gal(b1-4)Glc-ol', 0.8661944270133972),
 ('Man(a1-2)Man(a1-2)Man(a1-3)[Man(a1-2)Man(a1-3)[Man(a1-2)Man(a1-6)]Man(a1-6)]Man(b1-4)GlcNAc(b1-4)GlcNAc',
  0.814888596534729),
 ('Neu5Ac(a2-8)Neu5Ac(a2-3)Gal(b1-4)Glc1Cer', 0.8540824055671692),
 ('Gal(b1-4)Glc-ol', 0.7748590111732483),
 ('Gal(b1-4)GlcNAc(b1-2)[Gal(b1-4)GlcNAc(b1-?)]Man(a1-?)[GlcNAc(b1-2)Man(a1-?)]Man(b1-4)GlcNAc(b1-4)[Fuc(a1-6)]GlcNAc',
  0.9565786123275757),
 ('Gal(b1-3)GlcNAc(b1-3)Gal(b1-4)Glc-ol', 0.812296450138092),
 ('Fuc(a1-4)GlcNAc(b1-3)Gal(b1-4)[Fuc(a1-3)]Glc-ol', 0.8908309936523438),
 ('Neu5Ac(a2-6)Gal(b1-?)[Fuc(a1-?)]GlcNAc(b1-?)[Fuc(a1-?)[Gal(b1-?)]GlcNAc(b1-?)]Man(a1-3)[Gal(b1-?)GlcNAc(b1-2)Man(a1-6)]Man(b1-4)GlcNAc(b1-4)[Fuc(a1-6)]GlcNAc',
  0.612709641456604),
 ('Neu5Ac(a2-3)Gal(b1-?)GlcNAc(b1-2)Man(a1-3)[Neu5Ac(a2-3)Gal(b1-?)[Neu5Ac(a2-6)]GlcNAc(b1-2)Man(a1-6)]Man(b1-4)GlcNAc(b1-4)[Fuc(a1-6)]GlcNAc',
  0.9313767552375793),
 ('GalNAc(b1-4)Gal(b1-4)GlcNAc(b1-6)[GalNAc(a1-3)Gal(b1-3)]GalNAc',
  0.9005035161972046)]

models

describes some examples for machine learning architectures applicable to glycans. The main portal is prep_models which allows users to setup (trained) models by their string names

SweetNet

 SweetNet (lib_size:int, num_classes:int=1, hidden_dim:int=128)

*Base class for all neural network modules.

Your models should also subclass this class.

Modules can also contain other Modules, allowing them to be nested in a tree structure. You can assign the submodules as regular attributes::

import torch.nn as nn
import torch.nn.functional as F

class Model(nn.Module):
    def __init__(self) -> None:
        super().__init__()
        self.conv1 = nn.Conv2d(1, 20, 5)
        self.conv2 = nn.Conv2d(20, 20, 5)

    def forward(self, x):
        x = F.relu(self.conv1(x))
        return F.relu(self.conv2(x))

Submodules assigned in this way will be registered, and will also have their parameters converted when you call :meth:to, etc.

.. note:: As per the example above, an __init__() call to the parent class must be made before assignment on the child.

ivar training: Boolean represents whether this module is in training or evaluation mode. :vartype training: bool*
	Type	Default	Details
lib_size	int		number of unique tokens for graph nodes
num_classes	int	1	number of output classes (>1 for multilabel)
hidden_dim	int	128	dimension of hidden layers
Returns	None

LectinOracle

 LectinOracle (input_size_glyco:int, hidden_size:int=128,
               num_classes:int=1, data_min:float=-11.355,
               data_max:float=23.892, input_size_prot:int=1280)

*Base class for all neural network modules.

Your models should also subclass this class.

Modules can also contain other Modules, allowing them to be nested in a tree structure. You can assign the submodules as regular attributes::

import torch.nn as nn
import torch.nn.functional as F

class Model(nn.Module):
    def __init__(self) -> None:
        super().__init__()
        self.conv1 = nn.Conv2d(1, 20, 5)
        self.conv2 = nn.Conv2d(20, 20, 5)

    def forward(self, x):
        x = F.relu(self.conv1(x))
        return F.relu(self.conv2(x))

Submodules assigned in this way will be registered, and will also have their parameters converted when you call :meth:to, etc.

.. note:: As per the example above, an __init__() call to the parent class must be made before assignment on the child.

ivar training: Boolean represents whether this module is in training or evaluation mode. :vartype training: bool*
	Type	Default	Details
input_size_glyco	int		number of unique tokens for graph nodes
hidden_size	int	128	layer size for graph convolutions
num_classes	int	1	number of output classes (>1 for multilabel)
data_min	float	-11.355	minimum observed value in training data
data_max	float	23.892	maximum observed value in training data
input_size_prot	int	1280	dimensionality of protein representations
Returns	None

LectinOracle_flex

 LectinOracle_flex (input_size_glyco:int, hidden_size:int=128,
                    num_classes:int=1, data_min:float=-11.355,
                    data_max:float=23.892, input_size_prot:int=1000)

*Base class for all neural network modules.

Your models should also subclass this class.

Modules can also contain other Modules, allowing them to be nested in a tree structure. You can assign the submodules as regular attributes::

import torch.nn as nn
import torch.nn.functional as F

class Model(nn.Module):
    def __init__(self) -> None:
        super().__init__()
        self.conv1 = nn.Conv2d(1, 20, 5)
        self.conv2 = nn.Conv2d(20, 20, 5)

    def forward(self, x):
        x = F.relu(self.conv1(x))
        return F.relu(self.conv2(x))

Submodules assigned in this way will be registered, and will also have their parameters converted when you call :meth:to, etc.

.. note:: As per the example above, an __init__() call to the parent class must be made before assignment on the child.

ivar training: Boolean represents whether this module is in training or evaluation mode. :vartype training: bool*
	Type	Default	Details
input_size_glyco	int		number of unique tokens for graph nodes
hidden_size	int	128	layer size for graph convolutions
num_classes	int	1	number of output classes (>1 for multilabel)
data_min	float	-11.355	minimum observed value in training data
data_max	float	23.892	maximum observed value in training data
input_size_prot	int	1000	maximum protein sequence length for padding/cutting
Returns	None

NSequonPred

 NSequonPred ()

*Base class for all neural network modules.

Your models should also subclass this class.

Modules can also contain other Modules, allowing them to be nested in a tree structure. You can assign the submodules as regular attributes::

import torch.nn as nn
import torch.nn.functional as F

class Model(nn.Module):
    def __init__(self) -> None:
        super().__init__()
        self.conv1 = nn.Conv2d(1, 20, 5)
        self.conv2 = nn.Conv2d(20, 20, 5)

    def forward(self, x):
        x = F.relu(self.conv1(x))
        return F.relu(self.conv2(x))

Submodules assigned in this way will be registered, and will also have their parameters converted when you call :meth:to, etc.

.. note:: As per the example above, an __init__() call to the parent class must be made before assignment on the child.

:ivar training: Boolean represents whether this module is in training or evaluation mode. :vartype training: bool*

init_weights

 init_weights (model:torch.nn.modules.module.Module, mode:str='sparse',
               sparsity:float=0.1)

initializes linear layers of PyTorch model with a weight initialization

	Type	Default	Details
model	Module		neural network for analyzing glycans
mode	str	sparse	initialization algorithm: ‘sparse’, ‘kaiming’, ‘xavier’
sparsity	float	0.1	proportion of sparsity after initialization
Returns	None

prep_model

 prep_model (model_type:Literal['SweetNet','LectinOracle','LectinOracle_fl
             ex','NSequonPred'], num_classes:int,
             libr:Optional[Dict[str,int]]=None, trained:bool=False,
             hidden_dim:int=128)

wrapper to instantiate model, initialize it, and put it on the GPU

	Type	Default	Details
model_type	Literal		type of model to create
num_classes	int		number of unique classes for classification
libr	Optional	None	dictionary of form glycoletter:index
trained	bool	False	whether to use pretrained model
hidden_dim	int	128	hidden dimension for the model (SweetNet/LectinOracle only)
Returns	Module		initialized PyTorch model

processing

contains helper functions to prepare glycan data for model training

dataset_to_graphs

 dataset_to_graphs (glycan_list:List[str], labels:List[Union[float,int]],
                    libr:Optional[Dict[str,int]]=None,
                    label_type:torch.dtype=torch.int64)

wrapper function to convert a whole list of glycans into a graph dataset

	Type	Default	Details
glycan_list	List		list of IUPAC-condensed glycan sequences
labels	List		list of labels
libr	Optional	None	dictionary of glycoletter:index
label_type	dtype	torch.int64	tensor type for label
Returns	List		list of node/edge/label data tuples

dataset_to_graphs(["Neu5Ac(a2-3)Gal(b1-4)Glc",
                  "Fuc(a1-2)Gal(b1-3)GalNAc"], [1, 0])

[Data(edge_index=[2, 8], labels=[5], string_labels=[5], num_nodes=5, y=1),
 Data(edge_index=[2, 8], labels=[5], string_labels=[5], num_nodes=5, y=0)]

dataset_to_dataloader

 dataset_to_dataloader (glycan_list:List[str],
                        labels:List[Union[float,int]],
                        libr:Optional[Dict[str,int]]=None,
                        batch_size:int=32, shuffle:bool=True,
                        drop_last:bool=False,
                        extra_feature:Optional[List[float]]=None,
                        label_type:torch.dtype=torch.int64,
                        augment_prob:float=0.0,
                        generalization_prob:float=0.2)

wrapper function to convert glycans and labels to a torch_geometric DataLoader

	Type	Default	Details
glycan_list	List		list of IUPAC-condensed glycans
labels	List		list of labels
libr	Optional	None	dictionary of glycoletter:index
batch_size	int	32	samples per batch
shuffle	bool	True	shuffle samples in dataloader
drop_last	bool	False	drop last batch
extra_feature	Optional	None	additional input features
label_type	dtype	torch.int64	tensor type for label
augment_prob	float	0.0	probability of data augmentation
generalization_prob	float	0.2	probability of wildcarding
Returns	DataLoader		dataloader for training

next(iter(dataset_to_dataloader(["Neu5Ac(a2-3)Gal(b1-4)Glc",
                                 "Fuc(a1-2)Gal(b1-3)GalNAc"], [1, 0])))

DataBatch(edge_index=[2, 16], labels=[10], string_labels=[2], num_nodes=10, y=[2], batch=[10], ptr=[3])

split_data_to_train

 split_data_to_train (glycan_list_train:List[str],
                      glycan_list_val:List[str],
                      labels_train:List[Union[float,int]],
                      labels_val:List[Union[float,int]],
                      libr:Optional[Dict[str,int]]=None,
                      batch_size:int=32, drop_last:bool=False,
                      extra_feature_train:Optional[List[float]]=None,
                      extra_feature_val:Optional[List[float]]=None,
                      label_type:torch.dtype=torch.int64,
                      augment_prob:float=0.0,
                      generalization_prob:float=0.2)

wrapper function to convert split training/test data into dictionary of dataloaders

	Type	Default	Details
glycan_list_train	List		training glycans
glycan_list_val	List		validation glycans
labels_train	List		training labels
labels_val	List		validation labels
libr	Optional	None	dictionary of glycoletter:index
batch_size	int	32	samples per batch
drop_last	bool	False	drop last batch
extra_feature_train	Optional	None	additional training features
extra_feature_val	Optional	None	additional validation features
label_type	dtype	torch.int64	tensor type for label
augment_prob	float	0.0	probability of data augmentation
generalization_prob	float	0.2	probability of wildcarding
Returns	Dict		dictionary of train/val dataloaders

split_data_to_train(["Neu5Ac(a2-3)Gal(b1-4)Glc", "Fuc(a1-2)Gal(b1-3)GalNAc"],
                    ["Neu5Ac(a2-6)Gal(b1-4)Glc", "Fuc(a1-2)Gal(a1-3)GalNAc"],
                    [1, 0], [0,1])

{'train': <torch_geometric.loader.dataloader.DataLoader>,
 'val': <torch_geometric.loader.dataloader.DataLoader>}

inference

>can be used to analyze trained models, make predictions, or obtain glycan representations

glycans_to_emb

 glycans_to_emb (glycans:List[str], model:torch.nn.modules.module.Module,
                 libr:Optional[Dict[str,int]]=None, batch_size:int=32,
                 rep:bool=True, class_list:Optional[List[str]]=None)

Returns a dataframe of learned representations for a list of glycans

	Type	Default	Details
glycans	List		list of glycans in IUPAC-condensed
model	Module		trained graph neural network for analyzing glycans
libr	Optional	None	dictionary of form glycoletter:index
batch_size	int	32	batch size used during training
rep	bool	True	True returns representations, False returns predicted labels
class_list	Optional	None	list of unique classes to map predictions
Returns	Union		dataframe of representations or list of predictions

get_lectin_preds

 get_lectin_preds (prot:str, glycans:List[str],
                   model:torch.nn.modules.module.Module,
                   prot_dic:Optional[Dict[str,List[float]]]=None,
                   background_correction:bool=False, correction_df:Optiona
                   l[pandas.core.frame.DataFrame]=None,
                   batch_size:int=128, libr:Optional[Dict[str,int]]=None,
                   sort:bool=True, flex:bool=False)

Wrapper that uses LectinOracle-type model for predicting binding of protein to glycans

	Type	Default	Details
prot	str		protein amino acid sequence
glycans	List		list of glycans in IUPAC-condensed
model	Module		trained LectinOracle-type model
prot_dic	Optional	None	dict of protein sequence:ESM1b representation
background_correction	bool	False	whether to correct predictions for background
correction_df	Optional	None	background prediction for glycans
batch_size	int	128	batch size used during training
libr	Optional	None	dict of glycoletter:index
sort	bool	True	whether to sort prediction results descendingly
flex	bool	False	LectinOracle (False) or LectinOracle_flex (True)
Returns	DataFrame		glycan sequences and predicted binding

get_Nsequon_preds

 get_Nsequon_preds (prots:List[str], model:torch.nn.modules.module.Module,
                    prot_dic:Dict[str,List[float]])

Predicts whether an N-sequon will be glycosylated

	Type	Details
prots	List	20 AA + N + 20 AA sequences; replace missing with ‘z’
model	Module	trained NSequonPred-type model
prot_dic	Dict	dict of protein sequence:ESM1b representation
Returns	DataFrame	protein sequences and predicted likelihood

get_esm1b_representations

 get_esm1b_representations (prots:List[str],
                            model:torch.nn.modules.module.Module,
                            alphabet:Any)

Retrieves ESM1b representations of protein for using them as input for LectinOracle

	Type	Details
prots	List	list of protein sequences to convert
model	Module	trained ESM1b model
alphabet	Any	used for converting sequences
Returns	Dict	dict of protein sequence:ESM1b representation

In order to run get_esm1b_representations, you first have to run this snippet:

!pip install fair-esm import esm model, alphabet = esm.pretrained.esm1b_t33_650M_UR50S()

train_test_split

contains various data split functions to get appropriate training and test sets

hierarchy_filter

 hierarchy_filter (df_in:pandas.core.frame.DataFrame, rank:str='Domain',
                   min_seq:int=5, wildcard_seed:bool=False,
                   wildcard_list:Optional[List[str]]=None,
                   wildcard_name:Optional[str]=None, r:float=0.1,
                   col:str='glycan')

stratified data split in train/test at the taxonomic level, removing duplicate glycans and infrequent classes

	Type	Default	Details
df_in	DataFrame		dataframe of glycan sequences and taxonomic labels
rank	str	Domain	taxonomic rank to filter
min_seq	int	5	minimum glycans per class
wildcard_seed	bool	False	seed wildcard glycoletters
wildcard_list	Optional	None	glycoletters for wildcard
wildcard_name	Optional	None	wildcard name in IUPAC
r	float	0.1	replacement rate
col	str	glycan	column name for glycans
Returns	Tuple		train/val splits and mappings

train_x, val_x, train_y, val_y, id_val, class_list, class_converter = hierarchy_filter(df_species,
                                                                                       rank = 'Kingdom')
print(train_x[:10])

['Neu5Ac(a2-8)Neu5Ac(a2-6)[Gal(?1-?)GalNAc(a1-3)]GalNAc', 'Man(a1-2)Man(a1-2)Man(b1-3)GlcNAc(a1-6)Man', 'GlcNAc(b1-2)Man(a1-6)[Man(a1-3)][GlcNAc(b1-4)]Man(b1-4)GlcNAc(b1-4)[Fuc(a1-6)]GlcNAc', 'GalNAcOS(b1-4)GlcNAc(b1-2)Man(a1-3)[GalNAcOS(b1-4)GlcNAc(b1-2)Man(a1-6)]Man(b1-4)GlcNAc(b1-4)GlcNAc', 'Neu5Ac(a2-3)Gal(b1-4)GlcNAc(b1-4)[Gal(b1-4)GlcNAc(b1-2)]Man(a1-3)[Neu5Ac(a2-3)Gal(b1-4)GlcNAc(b1-2)Man(a1-6)]Man(b1-4)GlcNAc(b1-4)GlcNAc', 'Neu5Ac(a2-?)Gal(b1-4)[Fuc(a1-3)]GlcNAc(b1-2)Man(a1-?)[Fuc(a1-3)[Gal(b1-4)]GlcNAc(b1-2)Man(a1-?)]Man(b1-4)GlcNAc(b1-4)[Fuc(a1-6)]GlcNAc', '{Gal(b1-4)GlcNAc(b1-?)}{Gal(b1-4)GlcNAc(b1-?)}{Gal(b1-4)GlcNAc(b1-?)}{Neu5Ac(a2-?)}{Neu5Ac(a2-?)}Gal(b1-4)GlcNAc(b1-2)[Gal(b1-4)GlcNAc(b1-4)]Man(a1-3)[Gal(b1-4)GlcNAc(b1-2)[Gal(b1-4)GlcNAc(b1-6)]Man(a1-6)]Man(b1-4)GlcNAc(b1-4)[Fuc(a1-6)]GlcNAc', 'Glc(a1-4)Glc(a1-4)[Glc(a1-6)Glc(a1-6)]Glc(a1-4)Glc(a1-4)Glc(a1-4)Glc(a1-4)Glc(a1-4)Glc(a1-4)Glc(a1-4)Glc', 'Fuc(a1-?)Hex(?1-?)[Hex(?1-?)]GalNAc', 'Neu5Ac(a2-?)GalNAc(b1-4)GlcNAc(b1-2)Man(a1-3)[Gal(b1-4)GlcNAc(b1-2)Man(a1-6)]Man(b1-4)GlcNAc(b1-4)[Fuc(a1-6)]GlcNAc']

general_split

 general_split (glycans:List[str], labels:List[Union[float,int,str]],
                test_size:float=0.2)

splits glycans and labels into train / test sets

	Type	Default	Details
glycans	List		list of IUPAC-condensed glycans
labels	List		list of prediction labels
test_size	float	0.2	size of test set
Returns	Tuple		train/test splits

train_x, val_x, train_y, val_y = general_split(df_species.glycan.values.tolist(),
                                              df_species.Species.values.tolist())
print(train_x[:10])

['GlcNAc(b1-2)[GlcNAc(b1-4)]Man(a1-3)[GlcNAc(b1-2)Man(a1-6)][GlcNAc(b1-4)]Man(b1-4)GlcNAc(b1-4)[Fuc(a1-6)]GlcNAc', 'Fuc(a1-2)[GalNAc(a1-3)]Gal(b1-?)GlcNAc6S(b1-6)[Fuc(a1-2)[GalNAc(a1-3)]Gal(b1-3)]GalNAc', 'Neu5Ac(a2-?)GalNAc(b1-4)GlcNAc(b1-2)Man(a1-3)[Gal(b1-4)[Neu5Ac(a2-?)]GlcNAc(b1-2)Man(a1-6)]Man(b1-4)GlcNAc(b1-4)[Fuc(a1-6)]GlcNAc', 'Glc(a1-4)Glc(a1-4)[Glc(a1-4)Glc(a1-4)Glc(a1-6)]Glc(a1-4)Glc', 'Gal(b1-4)GlcNAc(b1-2)Man(a1-3)[Gal(b1-4)GlcNAc(b1-2)Man(a1-6)]Man(b1-4)GlcNAc(b1-4)GlcNAc', 'Fuc(a1-2)Gal(b1-4)Glc-ol', 'Neu5Gc(a2-8)Neu5Ac(a2-6)[Gal(b1-3)]GalNAc', 'Glc(b1-4)Rha(b1-3)[Glc(a1-6)]Gal', 'Glc6Ac(b1-2)Glc1Ole6Ac(b1-4)Glc6Ac', 'Neu5Ac(a2-3)[GalNAc(b1-4)]Gal(b1-4)Glc-ol']

prepare_multilabel

 prepare_multilabel (df:pandas.core.frame.DataFrame, rank:str='Species',
                     glycan_col:str='glycan')

converts a one row per glycan-species/tissue/disease association file to a format of one glycan - all associations

	Type	Default	Details
df	DataFrame		dataframe with one glycan-association per row
rank	str	Species	label column to use
glycan_col	str	glycan	column with glycan sequences
Returns	Tuple		unique glycans and their label vectors

glycans, labels = prepare_multilabel(df_species[df_species.Order == 'Carnivora'])
print(glycans[50])
print(labels[50])

GlcNAcOS(b1-6)Gal(b1-3)GalNAc
[0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 1.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0]