[source]

Z_RandomOverSampler

galaxy_ml.preprocessors._z_random_over_sampler.Z_RandomOverSampler(sampling_strategy='auto', return_indices=False, random_state=None, ratio=None, negative_thres=0, positive_thres=-1)

[source]

TDMScaler

galaxy_ml.preprocessors._tdm_scaler.TDMScaler(q_lower=25.0, q_upper=75.0)

Scale features using Training Distribution Matching (TDM) algorithm

References

.. [1] Thompson JA, Tan J and Greene CS (2016) Cross-platform
       normalization of microarray and RNA-seq data for machine
       learning applications. PeerJ 4, e1621.

[source]

GenomeOneHotEncoder

galaxy_ml.preprocessors._genome_one_hot_encoder.GenomeOneHotEncoder(fasta_path=None, padding=True, seq_length=None)

Convert Genomic sequences to one-hot encoded 2d array

Paramaters

  • fasta_path: str, default None
    File path to the fasta file. There could two other ways to set up fasta_path. 1) through fit_params; 2) set_params(). If fasta_path is None, we suppose the sequences are contained in first column of X.
  • padding: bool, default is False
    All sequences are expected to be in the same length, but sometimes not. If True, all sequences use the same length of first entry by either padding or truncating. If False, raise ValueError if different seuqnce lengths are found.
  • seq_length: None or int
    Sequence length. If None, determined by the the first entry.

[source]

ProteinOneHotEncoder

galaxy_ml.preprocessors._genome_one_hot_encoder.ProteinOneHotEncoder(fasta_path=None, padding=True, seq_length=None)

Convert protein sequences to one-hot encoded 2d array

Paramaters

  • fasta_path: str, default None
    File path to the fasta file. There could two other ways to set up fasta_path. 1) through fit_params; 2) set_params(). If fasta_path is None, we suppose the sequences are contained in first column of X.
  • padding: bool, default is False
    All sequences are expected to be in the same length, but sometimes not. If True, all sequences use the same length of first entry by either padding or truncating. If False, raise ValueError if different seuqnce lengths are found.
  • seq_length: None or int
    Sequence length. If None, determined by the the first entry.

[source]

FastaIterator

galaxy_ml.preprocessors._fasta_iterator.FastaIterator(n, batch_size=32, shuffle=True, seed=0)

Base class for fasta sequence iterators.

Parameters

  • n: int
    Total number of samples
  • batch_size: int
    Size of batch
  • shuffle: bool
    Whether to shuffle data between epoch
  • seed: int
    Random seed number for data shuffling

[source]

FastaToArrayIterator

galaxy_ml.preprocessors._fasta_iterator.FastaToArrayIterator(X, generator, y=None, batch_size=32, shuffle=True, sample_weight=None, seed=None)

Iterator yielding Numpy array from fasta sequences

Parameters

  • X: array
    Contains sequence indexes in the fasta file
  • generator: fitted object
    instance of BatchGenerator, e.g., FastaDNABatchGenerator or FastaProteinBatchGenerator
  • y: array
    Target labels or values
  • batch_size: int, default=32
  • shuffle: bool, default=True
    Whether to shuffle the data between epochs
  • sample_weight: None or array
    Sample weight
  • seed: int
    Random seed for data shuffling

[source]

FastaDNABatchGenerator

galaxy_ml.preprocessors._fasta_dna_batch_generator.FastaDNABatchGenerator(fasta_path, seq_length=1000, shuffle=True, seed=None)

Fasta squence batch data generator, online transformation of sequences to array.

Parameters

  • fasta_path: str
    File path to fasta file.
  • seq_length: int, default=1000
    Sequence length, number of bases.
  • shuffle: bool, default=True
    Whether to shuffle the data between epochs
  • seed: int
    Random seed for data shuffling

[source]

FastaRNABatchGenerator

galaxy_ml.preprocessors._fasta_rna_batch_generator.FastaRNABatchGenerator(fasta_path, seq_length=1000, shuffle=True, seed=None)

Fasta squence batch data generator, online transformation of sequences to array.

Parameters

  • fasta_path: str
    File path to fasta file.
  • seq_length: int, default=1000
    Sequence length, number of bases.
  • shuffle: bool, default=True
    Whether to shuffle the data between epochs
  • seed: int
    Random seed for data shuffling

[source]

FastaProteinBatchGenerator

galaxy_ml.preprocessors._fasta_protein_batch_generator.FastaProteinBatchGenerator(fasta_path, seq_length=1000, shuffle=True, seed=None)

Fasta squence batch data generator, online transformation of sequences to array.

Parameters

  • fasta_path: str
    File path to fasta file.
  • seq_length: int, default=1000
    Sequence length, number of bases.
  • shuffle: bool, default=True
    Whether to shuffle the data between epochs
  • seed: int
    Random seed for data shuffling

[source]

IntervalsToArrayIterator

galaxy_ml.preprocessors._genomic_interval_batch_generator.IntervalsToArrayIterator(X, generator, y=None, batch_size=32, shuffle=True, sample_weight=None, seed=None, sample_probabilities=None)

Iterator yielding Numpy array from intervals and reference sequences.

Parameters

  • X: array
    Contains sequence indexes in the fasta file
  • generator: fitted object
    instance of GenomicIntervalBatchGenerator.
  • y: None
    The existing of y is due to inheritence, should be always None.
  • batch_size: int, default=32
  • shuffle: bool, default=True
    Whether to shuffle the data between epochs
  • sample_weight: None or array
    Sample weight
  • seed: int
    Random seed for data shuffling
  • sample_probabilities: 1-D array or None, default is None.
    The probabilities to draw samples. Different from the sample weight, this parameter only changes the the frequency of sampling, won't the loss during training.

[source]

GenomicIntervalBatchGenerator

galaxy_ml.preprocessors._genomic_interval_batch_generator.GenomicIntervalBatchGenerator(ref_genome_path=None, intervals_path=None, target_path=None, features='infer', blacklist_regions='hg38', shuffle=True, seed=None, seq_length=1000, center_bin_to_predict=200, feature_thresholds=0.5, random_state=None)

Generate sequence array and target values from a reference genome, intervals and genomic feature dataset. Try to mimic the the selene_sdk.samplers.interval_sampler.IntervalsSampler.

Parameters

  • ref_genome_path: str
    File path to the reference genomce, usually in fasta format.
  • intervals_path: str
    File path to the intervals dataset.
  • target_path: str
    File path to the dataset containing genomic features or target information, usually in bed ir bed.gz format.
  • features: list of str or 'infer'
    A list of features to predict. If 'infer', retrieve all the unique features from the target file.
  • blacklist_regions: str
    E.g., 'hg38'. For more info, refer to selene_sdk.sequences.Genome.
  • shuffle: bool, default=True
    Whether to shuffle the data between epochs.
  • seed: int or None, default=None
    Random seed for shuffling between epocks.
  • seq_length: int, default=1000
    Retrived sequence length.
  • center_bin_to_predict: int, default=200
    Query the tabix-indexed file for a region of length center_bin_to_predict.
  • feature_thresholds: float, default=0.5
    Threshold values to determine target value.
  • random_state: int or None, default=None
    Random seed for sampling sequences with changing position.

[source]

GenomicVariantBatchGenerator

galaxy_ml.preprocessors._genomic_variant_batch_generator.GenomicVariantBatchGenerator(ref_genome_path=None, vcf_path=None, blacklist_regions='hg38', seq_length=1000, output_reference=False)

keras.utils.Sequence capable sequence array generator from a reference genome and VCF (variant call format) file.

Parameters

  • ref_genome_path: str
    File path to the reference genomce, usually in fasta format.
  • vcf_path: str
    File path to the VCF dataset.
  • blacklist_regions: str
    E.g., 'hg38'. For more info, refer to selene_sdk.sequences.Genome.
  • seq_length: int, default=1000
    Retrived sequence length.
  • output_reference: bool, default is False.
    If True, output reference sequence instead.

[source]

ImageDataFrameBatchGenerator

galaxy_ml.preprocessors._image_batch_generator.ImageDataFrameBatchGenerator(dataframe, featurewise_center=False, samplewise_center=False, featurewise_std_normalization=False, samplewise_std_normalization=False, zca_whitening=False, zca_epsilon=1e-06, rotation_range=0, width_shift_range=0.0, height_shift_range=0.0, brightness_range=None, shear_range=0.0, zoom_range=0.0, channel_shift_range=0.0, fill_mode='nearest', cval=0.0, horizontal_flip=False, vertical_flip=False, rescale=None, preprocessing_function=None, data_format='channels_last', interpolation_order=1, dtype='float32', directory=None, x_col='filename', y_col='class', weight_col=None, target_size=(256, 256), color_mode='rgb', classes=None, class_mode='categorical', shuffle=True, seed=None, save_to_dir=None, save_prefix='', save_format='png', interpolation='nearest', fit_sample_size=None)

Extend keras_preprocessing.image.ImageDataGenerator to work with DataFrame exclusively, generating batches of tensor data from images with online augumentation.

Parameters

From `keras_preprocessing.image.ImageDataGenerator`.
  • featurewise_center: Boolean.
    Set input mean to 0 over the dataset, feature-wise.
  • samplewise_center: Boolean. Set each sample mean to 0.
  • featurewise_std_normalization: Boolean.
    Divide inputs by std of the dataset, feature-wise.
  • samplewise_std_normalization: Boolean. Divide each input by its std.
  • zca_whitening: Boolean. Apply ZCA whitening.
  • zca_epsilon: epsilon for ZCA whitening. Default is 1e-6.
  • rotation_range: Int. Degree range for random rotations.
  • width_shift_range: Float, 1-D array-like or int.
  • height_shift_range: Float, 1-D array-like or int.
  • brightness_range: Tuple or list of two floats.
  • shear_range: Float. Shear Intensity.
  • zoom_range: Float or [lower, upper].
  • channel_shift_range: Float. Range for random channel shifts.
  • fill_mode: One of {"constant", "nearest", "reflect" or "wrap"}.
    Default is 'nearest'. Points outside the boundaries of the input are filled according to the given mode: - 'constant': kkkkkkkk|abcd|kkkkkkkk (cval=k) - 'nearest': aaaaaaaa|abcd|dddddddd - 'reflect': abcddcba|abcd|dcbaabcd - 'wrap': abcdabcd|abcd|abcdabcd
  • cval: Float or Int.
  • horizontal_flip: Boolean.
    Randomly flip inputs horizontally.
  • vertical_flip: Boolean.
  • rescale: rescaling factor. Defaults to None.
  • preprocessing_function: function that will be applied on each input.
    The function will run after the image is resized and augmented. The function should take one argument: one image (Numpy tensor with rank 3), and should output a Numpy tensor with the same shape.
  • data_format: Image data format,
    either "channels_first" or "channels_last". "channels_last" mode means that the images should have shape (samples, height, width, channels), "channels_first" mode means that the images should have shape (samples, channels, height, width). It defaults to the image_data_format value found in your Keras config file at ~/.keras/keras.json. If you never set it, then it will be "channels_last".
  • interpolation_order: Int.
  • dtype: Dtype to use for the generated arrays. Default is 'float32'.
  • dataframe: Pandas dataframe containing the filepaths relative to
    directory. From keras_preprocessing.image.ImageDataGenerator. flow_from_dataframe.
  • directory: string, path to the directory to read images from. If None,
    data in x_col column should be absolute paths.
  • x_col: string, column in dataframe that contains the filenames (or
    absolute paths if directory is None).
  • y_col: string or list, column/s in dataframe that has the target data.
  • weight_col: string, column in dataframe that contains the sample
    weights. Default: None.
  • target_size: tuple of integers (height, width), default: (256, 256).
    The dimensions to which all images found will be resized.
  • color_mode: one of "grayscale", "rgb", "rgba". Default: "rgb".
    Whether the images will be converted to have 1 or 3 color channels.
  • classes: optional list of classes (e.g. ['dogs', 'cats']).
    Default: None. If None, all classes in y_col will be used.
  • class_mode: one of "binary", "categorical", "input", "multi_output",
    "raw", sparse" or None. Default: "categorical". Mode for yielding the targets: - "binary": 1D numpy array of binary labels, - "categorical": 2D numpy array of one-hot encoded labels. Supports multi-label output. - "input": images identical to input images (mainly used to work with autoencoders), - "multi_output": list with the values of the different columns, - "raw": numpy array of values in y_col column(s), - "sparse": 1D numpy array of integer labels, - None, no targets are returned (the generator will only yield batches of image data, which is useful to use in model.predict_generator()).
  • shuffle: whether to shuffle the data (default: True)
  • seed: optional random seed for shuffling and transformations.
  • save_to_dir: Optional directory where to save the pictures
    being yielded, in a viewable format. This is useful for visualizing the random transformations being applied, for debugging purposes.
  • save_prefix: String prefix to use for saving sample
    images (if save_to_dir is set).
  • save_format: Format to use for saving sample images
    (if save_to_dir is set).
  • interpolation: Interpolation method used to resample the image if the
    target size is different from that of the loaded image. Supported methods are "nearest", "bilinear", and "bicubic". If PIL version 1.1.3 or newer is installed, "lanczos" is also supported. If PIL version 3.4.0 or newer is installed, "box" and "hamming" are also supported. By default, "nearest" is used.
  • fit_sample_size: Int. Default is None / 1000.
    Number of training images used in datagen.fit. Relevant only when featurewise_center or featurewise_std_normalization or zca_whitening are set are set to True.