Galaxy-ML

Galaxy-ML is a web machine learning end-to-end pipeline building framework, with special support to biomedical data. Under the management of unified scikit-learn APIs, cutting-edge machine learning libraries are combined together to provide thousands of different pipelines suitable for various needs. In the form of Galalxy tools, Galaxy-ML provides scalabe, reproducible and transparent machine learning computations.

Key features

simple web UI
no coding or minimum coding requirement
fast model deployment and model selection, specialized in hyperparameter tuning using GridSearchCV
high level of parallel and automated computation

Supported modules

A typic machine learning pipeline is composed of a main estimator/model and optional preprocessing component(s).

Model

scikit-learn
- sklearn.ensemble
- sklearn.linear_model
- sklearn.naive_bayes
- sklearn.neighbors
- sklearn.svm
- sklearn.tree
xgboost
- XGBClassifier
- XGBRegressor
mlxtend
- StackingCVClassifier
- StackingClassifier
- StackingCVRegressor
- StackingRegressor
Keras (Deep learning models are re-implemented to fully support sklearn APIs. Supports parameter, including layer subparameter, swaps or searches. Supports callbacks)
- KerasGClassifier
- KerasGRegressor
- KerasGBatchClassifier (works best with online data generators, processing images, genomic sequences and so on)
BinarizeTargetClassifier/BinarizeTargetRegressor
IRAPSClassifier

Preprocessor

scikit-learn
- sklearn.preprocessing
- sklearn.feature_selection
- sklearn.decomposition
- sklearn.kernel_approximation
- sklearn.cluster
imblanced-learn
- imblearn.under_sampling
- imblearn.over_sampling
- imblearn.combine
skrebate
- ReliefF
- SURF
- SURFstar
- MultiSURF
- MultiSURFstar
TDMScaler
DyRFE/DyRFECV
Z_RandomOverSampler
GenomeOneHotEncoder
ProteinOneHotEncoder
FastaDNABatchGenerator
FastaRNABatchGenerator
FastaProteinBatchGenerator
GenomicIntervalBatchGenerator
GenomicVariantBatchGenerator
ImageDataFrameBatchGenerator

Installation

APIs for models, preprocessors and utils implemented in Galaxy-ML can be installed separately.

Installing using anaconda (recommended)

conda install -c bioconda -c conda-forge Galaxy-ML

Installing using pip

pip install -U Galaxy-ML

Installing from source

python setup.py install

Using source code inplace

python setup.py build_ext --inplace

To install Galaxy-ML tools in Galaxy, please refer to https://galaxyproject.org/admin/tools/add-tool-from-toolshed-tutorial/.

Examples for using Galaxy-ML custom models

# handle imports
from keras.models import Sequential
from keras.layers import Dense, Activation
from sklearn.model_selection import GridSearchCV
from galaxy_ml.keras_galaxy_models import KerasGClassifier


# build a DNN classifier
model = Sequential()
model.add(Dense(64))
model.add(Activation(‘relu'))
model.add((Dense(1, activation=‘sigmoid’)))
config = model.get_config()

classifier = KerasGClassifier(config, random_state=42)


# clone a classifier
clf = clone(classifier)


# Get parameters
params = clf.get_params()


# Set parameters
new_params = dict(
    epochs=60,
    lr=0.01,
    layers_1_Dense__config__kernel_initializer__config__seed=999,
    layers_0_Dense__config__kernel_initializer__config__seed=999
)
clf.set_params(**new_params)


# model evaluation using GridSearchCV
grid = GridSearchCV(clf, param_grid={}, scoring=‘roc_auc’, cv=5, n_jobs=2)
grid.fit(X, y)

Example for using Galaxy-ML to persist a sklearn/keras model

from galaxy_ml.model_persist import (dump_model_to_h5,
                                     load_model_from_h5)

# dump model to hdf5
dump_model_to_h5(model, `save_path`,
                 store_hyperparameter=True)

# load model from hdf5
model = load_model_from_h5(`path_to_hdf5`)

Performance comparison

Galaxy-ML's HDF5 saving utils perform faster than cPickle for large, array-rich models.

Loading model using pickle...
(1.2471628189086914 s)

Dumping model using pickle...
(3.6942389011383057 s)
File size: 930712861

Dumping model to hdf5...
(3.006715774536133 s)
File size: 930729696

Loading model from hdf5...
(0.6420958042144775 s)

Pipeline(memory=None,
         steps=[('robustscaler',
                 RobustScaler(copy=True, quantile_range=(25.0, 75.0),
                              with_centering=True, with_scaling=True)),
                ('kneighborsclassifier',
                 KNeighborsClassifier(algorithm='auto', leaf_size=30,
                                      metric='minkowski', metric_params=None,
                                      n_jobs=1, n_neighbors=100, p=2,
                                      weights='uniform'))],
         verbose=False)