Z_RandomOverSampler
galaxy_ml.preprocessors._z_random_over_sampler.Z_RandomOverSampler(sampling_strategy='auto', return_indices=False, random_state=None, ratio=None, negative_thres=0, positive_thres=-1)
TDMScaler
galaxy_ml.preprocessors._tdm_scaler.TDMScaler(q_lower=25.0, q_upper=75.0)
Scale features using Training Distribution Matching (TDM) algorithm
References
.. [1] Thompson JA, Tan J and Greene CS (2016) Cross-platform
normalization of microarray and RNA-seq data for machine
learning applications. PeerJ 4, e1621.
GenomeOneHotEncoder
galaxy_ml.preprocessors._genome_one_hot_encoder.GenomeOneHotEncoder(fasta_path=None, padding=True, seq_length=None)
Convert Genomic sequences to one-hot encoded 2d array
Paramaters
- fasta_path: str, default None
File path to the fasta file. There could two other ways to set upfasta_path
. 1) through fit_params; 2) set_params(). If fasta_path is None, we suppose the sequences are contained in first column of X. - padding: bool, default is False
All sequences are expected to be in the same length, but sometimes not. If True, all sequences use the same length of first entry by either padding or truncating. If False, raise ValueError if different seuqnce lengths are found. - seq_length: None or int
Sequence length. If None, determined by the the first entry.
ProteinOneHotEncoder
galaxy_ml.preprocessors._genome_one_hot_encoder.ProteinOneHotEncoder(fasta_path=None, padding=True, seq_length=None)
Convert protein sequences to one-hot encoded 2d array
Paramaters
- fasta_path: str, default None
File path to the fasta file. There could two other ways to set upfasta_path
. 1) through fit_params; 2) set_params(). If fasta_path is None, we suppose the sequences are contained in first column of X. - padding: bool, default is False
All sequences are expected to be in the same length, but sometimes not. If True, all sequences use the same length of first entry by either padding or truncating. If False, raise ValueError if different seuqnce lengths are found. - seq_length: None or int
Sequence length. If None, determined by the the first entry.
FastaIterator
galaxy_ml.preprocessors._fasta_iterator.FastaIterator(n, batch_size=32, shuffle=True, seed=0)
Base class for fasta sequence iterators.
Parameters
- n: int
Total number of samples - batch_size: int
Size of batch - shuffle: bool
Whether to shuffle data between epoch - seed: int
Random seed number for data shuffling
FastaToArrayIterator
galaxy_ml.preprocessors._fasta_iterator.FastaToArrayIterator(X, generator, y=None, batch_size=32, shuffle=True, sample_weight=None, seed=None)
Iterator yielding Numpy array from fasta sequences
Parameters
- X: array
Contains sequence indexes in the fasta file - generator: fitted object
instance of BatchGenerator, e.g., FastaDNABatchGenerator or FastaProteinBatchGenerator - y: array
Target labels or values - batch_size: int, default=32
- shuffle: bool, default=True
Whether to shuffle the data between epochs - sample_weight: None or array
Sample weight - seed: int
Random seed for data shuffling
FastaDNABatchGenerator
galaxy_ml.preprocessors._fasta_dna_batch_generator.FastaDNABatchGenerator(fasta_path, seq_length=1000, shuffle=True, seed=None)
Fasta squence batch data generator, online transformation of sequences to array.
Parameters
- fasta_path: str
File path to fasta file. - seq_length: int, default=1000
Sequence length, number of bases. - shuffle: bool, default=True
Whether to shuffle the data between epochs - seed: int
Random seed for data shuffling
FastaRNABatchGenerator
galaxy_ml.preprocessors._fasta_rna_batch_generator.FastaRNABatchGenerator(fasta_path, seq_length=1000, shuffle=True, seed=None)
Fasta squence batch data generator, online transformation of sequences to array.
Parameters
- fasta_path: str
File path to fasta file. - seq_length: int, default=1000
Sequence length, number of bases. - shuffle: bool, default=True
Whether to shuffle the data between epochs - seed: int
Random seed for data shuffling
FastaProteinBatchGenerator
galaxy_ml.preprocessors._fasta_protein_batch_generator.FastaProteinBatchGenerator(fasta_path, seq_length=1000, shuffle=True, seed=None)
Fasta squence batch data generator, online transformation of sequences to array.
Parameters
- fasta_path: str
File path to fasta file. - seq_length: int, default=1000
Sequence length, number of bases. - shuffle: bool, default=True
Whether to shuffle the data between epochs - seed: int
Random seed for data shuffling
IntervalsToArrayIterator
galaxy_ml.preprocessors._genomic_interval_batch_generator.IntervalsToArrayIterator(X, generator, y=None, batch_size=32, shuffle=True, sample_weight=None, seed=None, sample_probabilities=None)
Iterator yielding Numpy array from intervals and reference sequences.
Parameters
- X: array
Contains sequence indexes in the fasta file - generator: fitted object
instance of GenomicIntervalBatchGenerator. - y: None
The existing of y is due to inheritence, should be always None. - batch_size: int, default=32
- shuffle: bool, default=True
Whether to shuffle the data between epochs - sample_weight: None or array
Sample weight - seed: int
Random seed for data shuffling - sample_probabilities: 1-D array or None, default is None.
The probabilities to draw samples. Different from the sample weight, this parameter only changes the the frequency of sampling, won't the loss during training.
GenomicIntervalBatchGenerator
galaxy_ml.preprocessors._genomic_interval_batch_generator.GenomicIntervalBatchGenerator(ref_genome_path=None, intervals_path=None, target_path=None, features='infer', blacklist_regions='hg38', shuffle=True, seed=None, seq_length=1000, center_bin_to_predict=200, feature_thresholds=0.5, random_state=None)
Generate sequence array and target values from a reference
genome, intervals and genomic feature dataset.
Try to mimic the the
selene_sdk.samplers.interval_sampler.IntervalsSampler
.
Parameters
- ref_genome_path: str
File path to the reference genomce, usually in fasta format. - intervals_path: str
File path to the intervals dataset. - target_path: str
File path to the dataset containing genomic features or target information, usually inbed
irbed.gz
format. - features: list of str or 'infer'
A list of features to predict. If 'infer', retrieve all the unique features from the target file. - blacklist_regions: str
E.g., 'hg38'. For more info, refer toselene_sdk.sequences.Genome
. - shuffle: bool, default=True
Whether to shuffle the data between epochs. - seed: int or None, default=None
Random seed for shuffling between epocks. - seq_length: int, default=1000
Retrived sequence length. - center_bin_to_predict: int, default=200
Query the tabix-indexed file for a region of lengthcenter_bin_to_predict
. - feature_thresholds: float, default=0.5
Threshold values to determine target value. - random_state: int or None, default=None
Random seed for sampling sequences with changing position.
GenomicVariantBatchGenerator
galaxy_ml.preprocessors._genomic_variant_batch_generator.GenomicVariantBatchGenerator(ref_genome_path=None, vcf_path=None, blacklist_regions='hg38', seq_length=1000, output_reference=False)
keras.utils.Sequence
capable sequence array generator
from a reference genome and VCF (variant call format) file.
Parameters
- ref_genome_path: str
File path to the reference genomce, usually in fasta format. - vcf_path: str
File path to the VCF dataset. - blacklist_regions: str
E.g., 'hg38'. For more info, refer toselene_sdk.sequences.Genome
. - seq_length: int, default=1000
Retrived sequence length. - output_reference: bool, default is False.
If True, output reference sequence instead.
ImageDataFrameBatchGenerator
galaxy_ml.preprocessors._image_batch_generator.ImageDataFrameBatchGenerator(dataframe, featurewise_center=False, samplewise_center=False, featurewise_std_normalization=False, samplewise_std_normalization=False, zca_whitening=False, zca_epsilon=1e-06, rotation_range=0, width_shift_range=0.0, height_shift_range=0.0, brightness_range=None, shear_range=0.0, zoom_range=0.0, channel_shift_range=0.0, fill_mode='nearest', cval=0.0, horizontal_flip=False, vertical_flip=False, rescale=None, preprocessing_function=None, data_format='channels_last', interpolation_order=1, dtype='float32', directory=None, x_col='filename', y_col='class', weight_col=None, target_size=(256, 256), color_mode='rgb', classes=None, class_mode='categorical', shuffle=True, seed=None, save_to_dir=None, save_prefix='', save_format='png', interpolation='nearest', fit_sample_size=None)
Extend keras_preprocessing.image.ImageDataGenerator
to work with
DataFrame exclusively, generating batches of tensor data from
images with online augumentation.
Parameters
From `keras_preprocessing.image.ImageDataGenerator`.
- featurewise_center: Boolean.
Set input mean to 0 over the dataset, feature-wise. - samplewise_center: Boolean. Set each sample mean to 0.
- featurewise_std_normalization: Boolean.
Divide inputs by std of the dataset, feature-wise. - samplewise_std_normalization: Boolean. Divide each input by its std.
- zca_whitening: Boolean. Apply ZCA whitening.
- zca_epsilon: epsilon for ZCA whitening. Default is 1e-6.
- rotation_range: Int. Degree range for random rotations.
- width_shift_range: Float, 1-D array-like or int.
- height_shift_range: Float, 1-D array-like or int.
- brightness_range: Tuple or list of two floats.
- shear_range: Float. Shear Intensity.
- zoom_range: Float or [lower, upper].
- channel_shift_range: Float. Range for random channel shifts.
- fill_mode: One of {"constant", "nearest", "reflect" or "wrap"}.
Default is 'nearest'. Points outside the boundaries of the input are filled according to the given mode: - 'constant': kkkkkkkk|abcd|kkkkkkkk (cval=k) - 'nearest': aaaaaaaa|abcd|dddddddd - 'reflect': abcddcba|abcd|dcbaabcd - 'wrap': abcdabcd|abcd|abcdabcd - cval: Float or Int.
- horizontal_flip: Boolean.
Randomly flip inputs horizontally. - vertical_flip: Boolean.
- rescale: rescaling factor. Defaults to None.
- preprocessing_function: function that will be applied on each input.
The function will run after the image is resized and augmented. The function should take one argument: one image (Numpy tensor with rank 3), and should output a Numpy tensor with the same shape. - data_format: Image data format,
either "channels_first" or "channels_last". "channels_last" mode means that the images should have shape(samples, height, width, channels)
, "channels_first" mode means that the images should have shape(samples, channels, height, width)
. It defaults to theimage_data_format
value found in your Keras config file at~/.keras/keras.json
. If you never set it, then it will be "channels_last". - interpolation_order: Int.
- dtype: Dtype to use for the generated arrays. Default is 'float32'.
- dataframe: Pandas dataframe containing the filepaths relative to
directory
. Fromkeras_preprocessing.image.ImageDataGenerator. flow_from_dataframe
. - directory: string, path to the directory to read images from. If
None
,
data inx_col
column should be absolute paths. - x_col: string, column in
dataframe
that contains the filenames (or
absolute paths ifdirectory
isNone
). - y_col: string or list, column/s in
dataframe
that has the target data. - weight_col: string, column in
dataframe
that contains the sample
weights. Default:None
. - target_size: tuple of integers
(height, width)
, default:(256, 256)
.
The dimensions to which all images found will be resized. - color_mode: one of "grayscale", "rgb", "rgba". Default: "rgb".
Whether the images will be converted to have 1 or 3 color channels. - classes: optional list of classes (e.g.
['dogs', 'cats']
).
Default: None. If None, all classes iny_col
will be used. - class_mode: one of "binary", "categorical", "input", "multi_output",
"raw", sparse" or None. Default: "categorical". Mode for yielding the targets: -"binary"
: 1D numpy array of binary labels, -"categorical"
: 2D numpy array of one-hot encoded labels. Supports multi-label output. -"input"
: images identical to input images (mainly used to work with autoencoders), -"multi_output"
: list with the values of the different columns, -"raw"
: numpy array of values iny_col
column(s), -"sparse"
: 1D numpy array of integer labels, -None
, no targets are returned (the generator will only yield batches of image data, which is useful to use inmodel.predict_generator()
). - shuffle: whether to shuffle the data (default: True)
- seed: optional random seed for shuffling and transformations.
- save_to_dir: Optional directory where to save the pictures
being yielded, in a viewable format. This is useful for visualizing the random transformations being applied, for debugging purposes. - save_prefix: String prefix to use for saving sample
images (ifsave_to_dir
is set). - save_format: Format to use for saving sample images
(ifsave_to_dir
is set). - interpolation: Interpolation method used to resample the image if the
target size is different from that of the loaded image. Supported methods are"nearest"
,"bilinear"
, and"bicubic"
. If PIL version 1.1.3 or newer is installed,"lanczos"
is also supported. If PIL version 3.4.0 or newer is installed,"box"
and"hamming"
are also supported. By default,"nearest"
is used. - fit_sample_size: Int. Default is None / 1000.
Number of training images used indatagen.fit
. Relevant only whenfeaturewise_center
orfeaturewise_std_normalization
orzca_whitening are set
are set to True.