Galaxy-ML
Galaxy-ML is a web machine learning end-to-end pipeline building framework, with special support to biomedical data. Under the management of unified scikit-learn APIs, cutting-edge machine learning libraries are combined together to provide thousands of different pipelines suitable for various needs. In the form of Galalxy tools, Galaxy-ML provides scalabe, reproducible and transparent machine learning computations.
Key features
- simple web UI
- no coding or minimum coding requirement
- fast model deployment and model selection, specialized in hyperparameter tuning using
GridSearchCV
- high level of parallel and automated computation
Supported modules
A typic machine learning pipeline is composed of a main estimator/model and optional preprocessing component(s).
Model
- scikit-learn
- sklearn.ensemble
- sklearn.linear_model
- sklearn.naive_bayes
- sklearn.neighbors
- sklearn.svm
- sklearn.tree
- xgboost
- XGBClassifier
- XGBRegressor
-
- StackingCVClassifier
- StackingClassifier
- StackingCVRegressor
- StackingRegressor
-
Keras (Deep learning models are re-implemented to fully support sklearn APIs. Supports parameter, including layer subparameter, swaps or searches. Supports
callbacks
)- KerasGClassifier
- KerasGRegressor
- KerasGBatchClassifier (works best with online data generators, processing images, genomic sequences and so on)
-
BinarizeTargetClassifier/BinarizeTargetRegressor
- IRAPSClassifier
Preprocessor
- scikit-learn
- sklearn.preprocessing
- sklearn.feature_selection
- sklearn.decomposition
- sklearn.kernel_approximation
- sklearn.cluster
- imblanced-learn
- imblearn.under_sampling
- imblearn.over_sampling
- imblearn.combine
- skrebate
- ReliefF
- SURF
- SURFstar
- MultiSURF
- MultiSURFstar
- TDMScaler
- DyRFE/DyRFECV
- Z_RandomOverSampler
- GenomeOneHotEncoder
- ProteinOneHotEncoder
- FastaDNABatchGenerator
- FastaRNABatchGenerator
- FastaProteinBatchGenerator
- GenomicIntervalBatchGenerator
- GenomicVariantBatchGenerator
- ImageDataFrameBatchGenerator
Installation
APIs for models, preprocessors and utils implemented in Galaxy-ML can be installed separately.
Installing using anaconda (recommended)
conda install -c bioconda -c conda-forge Galaxy-ML
Installing using pip
pip install -U Galaxy-ML
Installing from source
python setup.py install
Using source code inplace
python setup.py build_ext --inplace
To install Galaxy-ML tools in Galaxy, please refer to https://galaxyproject.org/admin/tools/add-tool-from-toolshed-tutorial/.
Examples for using Galaxy-ML custom models
# handle imports
from keras.models import Sequential
from keras.layers import Dense, Activation
from sklearn.model_selection import GridSearchCV
from galaxy_ml.keras_galaxy_models import KerasGClassifier
# build a DNN classifier
model = Sequential()
model.add(Dense(64))
model.add(Activation(‘relu'))
model.add((Dense(1, activation=‘sigmoid’)))
config = model.get_config()
classifier = KerasGClassifier(config, random_state=42)
# clone a classifier
clf = clone(classifier)
# Get parameters
params = clf.get_params()
# Set parameters
new_params = dict(
epochs=60,
lr=0.01,
layers_1_Dense__config__kernel_initializer__config__seed=999,
layers_0_Dense__config__kernel_initializer__config__seed=999
)
clf.set_params(**new_params)
# model evaluation using GridSearchCV
grid = GridSearchCV(clf, param_grid={}, scoring=‘roc_auc’, cv=5, n_jobs=2)
grid.fit(X, y)
Example for using Galaxy-ML to persist a sklearn/keras model
from galaxy_ml.model_persist import (dump_model_to_h5,
load_model_from_h5)
# dump model to hdf5
dump_model_to_h5(model, `save_path`,
store_hyperparameter=True)
# load model from hdf5
model = load_model_from_h5(`path_to_hdf5`)
Performance comparison
Galaxy-ML's HDF5 saving utils perform faster than cPickle for large, array-rich models.
Loading model using pickle...
(1.2471628189086914 s)
Dumping model using pickle...
(3.6942389011383057 s)
File size: 930712861
Dumping model to hdf5...
(3.006715774536133 s)
File size: 930729696
Loading model from hdf5...
(0.6420958042144775 s)
Pipeline(memory=None,
steps=[('robustscaler',
RobustScaler(copy=True, quantile_range=(25.0, 75.0),
with_centering=True, with_scaling=True)),
('kneighborsclassifier',
KNeighborsClassifier(algorithm='auto', leaf_size=30,
metric='minkowski', metric_params=None,
n_jobs=1, n_neighbors=100, p=2,
weights='uniform'))],
verbose=False)