simdna package¶
simdna.simulations module¶
-
simdna.simulations.
get_distribution
(GC_fraction)¶
-
simdna.simulations.
motif_density
(motif_name, seq_length, num_seqs, min_counts, max_counts, GC_fraction, central_bp=None)¶ returns sequences with motif density.
-
simdna.simulations.
simple_motif_embedding
(motif_name, seq_length, num_seqs, GC_fraction)¶ returns sequence array
-
simdna.simulations.
simulate_differential_accessibility
(pos_motif_names, neg_motif_names, seq_length, min_num_motifs, max_num_motifs, num_pos, num_neg, GC_fraction)¶ Generates data for differential accessibility task.
Parameters: - pos_motif_names (list) – List of strings.
- neg_motif_names (list) – List of strings.
- seq_length (int) –
- min_num_motifs (int) –
- max_num_motifs (int) –
- num_pos (int) –
- num_neg (int) –
- GC_fraction (float) –
Returns: - sequence_arr (1darray) – Contains sequence strings.
- y (1darray) – Contains labels.
-
simdna.simulations.
simulate_heterodimer_grammar
(motif1, motif2, seq_length, min_spacing, max_spacing, num_pos, num_neg, GC_fraction)¶ - Simulates two classes of sequences with motif1 and motif2:
- Positive class sequences with motif1 and motif2 positioned min_spacing and max_spacing
- Negative class sequences with independent motif1 and motif2 positioned
anywhere in the sequence, not as a heterodimer grammar
Parameters: - seq_length (int, length of sequence) –
- GC_fraction (float, GC fraction in background sequence) –
- num_pos (int, number of positive class sequences) –
- num_neg (int, number of negatice class sequences) –
- motif1 (str, encode motif name) –
- motif2 (str, encode motif name) –
- min_spacing (int, minimum inter motif spacing) –
- max_spacing (int, maximum inter motif spacing) –
Returns: - sequence_arr (1darray) – Array with sequence strings.
- y (1darray) – Array with positive/negative class labels.
-
simdna.simulations.
simulate_motif_counting
(motif_name, seq_length, pos_counts, neg_counts, num_pos, num_neg, GC_fraction)¶ Generates data for motif counting task.
Parameters: - motif_name (str) –
- seq_length (int) –
- pos_counts (list) – (min_counts, max_counts) for positive set.
- neg_counts (list) – (min_counts, max_counts) for negative set.
- num_pos (int) –
- num_neg (int) –
- GC_fraction (float) –
Returns: - sequence_arr (1darray) – Contains sequence strings.
- y (1darray) – Contains labels.
-
simdna.simulations.
simulate_motif_density_localization
(motif_name, seq_length, center_size, min_motif_counts, max_motif_counts, num_pos, num_neg, GC_fraction)¶ - Simulates two classes of seqeuences:
- Positive class sequences with multiple motif instances in center of the sequence.
- Negative class sequences with multiple motif instances anywhere in the sequence.
The number of motif instances is uniformly sampled between minimum and maximum motif counts.
Parameters: - motif_name (str) – encode motif name
- seq_length (int) – length of sequence
- center_size (int) – length of central part of the sequence where motifs can be positioned
- min_motif_counts (int) – minimum number of motif instances
- max_motif_counts (int) – maximum number of motif instances
- num_pos (int) – number of positive class sequences
- num_neg (int) – number of negative class sequences
- GC_fraction (float) – GC fraction in background sequence
Returns: - sequence_arr (1darray) – Contains sequence strings.
- y (1darray) – Contains labels.
-
simdna.simulations.
simulate_multi_motif_embedding
(motif_names, seq_length, min_num_motifs, max_num_motifs, num_seqs, GC_fraction)¶ Generates data for multi motif recognition task.
Parameters: - motif_names (list) – List of strings.
- seq_length (int) –
- min_num_motifs (int) –
- max_num_motifs (int) –
- num_seqs (int) –
- GC_fraction (float) –
Returns: - sequence_arr (1darray) – Contains sequence strings.
- y (ndarray) – Contains labels for each motif.
-
simdna.simulations.
simulate_single_motif_detection
(motif_name, seq_length, num_pos, num_neg, GC_fraction)¶ - Simulates two classes of seqeuences:
- Positive class sequence with a motif embedded anywhere in the sequence
- Negative class sequence without the motif
Parameters: - motif_name (str) – encode motif name
- seq_length (int) – length of sequence
- num_pos (int) – number of positive class sequences
- num_neg (int) – number of negative class sequences
- GC_fraction (float) – GC fraction in background sequence
Returns: - sequence_arr (1darray) – Array with sequence strings.
- y (1darray) – Array with positive/negative class labels.
simdna.synthetic module¶
-
class
simdna.synthetic.
AbstractApplySingleMutationFromSet
(setOfMutations, name=None)¶ Bases:
simdna.synthetic.AbstractTransformation
Class for applying a single mutation from a set of mutations; used to transform substrings generated by another method
Parameters: - setOfMutations – instance of
AbstractSetOfMutations
- name – see
DefaultNameMixin
.
-
getJsonableObject
()¶ See superclass.
-
selectMutation
()¶ Chooses a mutation from the set of mutations to apply.
Returns: an instance of Mutation
-
transform
(stringArr)¶ See superclass.
- setOfMutations – instance of
-
class
simdna.synthetic.
AbstractBackgroundGenerator
¶ Bases:
object
Returns the sequence that
AbstractEmbeddable
objects are to be embedded into.-
generateBackground
()¶ Returns a sequence that is the background.
-
getJsonableObject
()¶ Get JSON object representation.
Returns: A json-friendly object (built of dictionaries, lists and python primitives), which can be converted to json to record the exact details of what was simualted.
-
-
class
simdna.synthetic.
AbstractEmbeddable
¶ Bases:
object
Represents a thing which can be embedded.
An
AbstractEmbeddable
+ a position = anEmbedding
-
canEmbed
(priorEmbeddedThings, startPos)¶ Checks whether embedding is possible at a given pos.
Accepts an instance of
AbstractPriorEmbeddedThings
and astartPos
, and checks ifstartPos
is viable given the contents ofpriorEmbeddedThings
.Parameters: priorEmbeddedThings – instance of startPos: int; the position you are considering embedding self atReturns: A boolean indicating whether self can be embedded at startPos, given the things that have already been embedded.
-
embedInBackgroundStringArr
(priorEmbeddedThings, backgroundStringArr, startPos)¶ Embed self in a background string.
Will embed self at
startPos
inbackgroundStringArr
, and will updatepriorEmbeddedThings
accordingly.Parameters: priorEmbeddedThings – instance of AbstractPriorEmbeddedThings
- backgroundStringArr: an array of characters representing
- the background
- startPos: integer; the position to embed self at
-
classmethod
fromString
(theString)¶ Generate an instance of the embeddable from the provided string.
-
getDescription
()¶ Return a concise description of the embeddable.
This should be concise and shouldn’t contain spaces. It will often be used when generating the __str__ representation of the embedabled.
-
-
class
simdna.synthetic.
AbstractEmbeddableGenerator
(name)¶ Bases:
simdna.synthetic.DefaultNameMixin
Generates an embeddable, usually for embedding in a background sequence.
-
generateEmbeddable
()¶ Generate an embeddable object.
Returns: An instance of AbstractEmbeddable
-
getJsonableObject
()¶ Get JSON object representation.
Returns: A json-friendly object (built of dictionaries, lists and python primitives), which can be converted to json to record the exact details of what was simualted.
-
-
class
simdna.synthetic.
AbstractEmbedder
(name)¶ Bases:
simdna.synthetic.DefaultNameMixin
Produces
AbstractEmbeddable
objects and embeds them in a sequence.-
embed
(backgroundStringArr, priorEmbeddedThings, additionalInfo=None)¶ Embeds things in the provided
backgroundStringArr
.Modifies backgroundStringArr to include whatever has been embedded.
Parameters: - backgroundStringArr – array of characters representing the background string
- priorEmbeddedThings – instance of
AbstractPriorEmbeddedThings
- additionalInfo – instance of
AdditionalInfo
; allows the embedder to send back info about what it did
Returns: The modifed
backgroundStringArr
-
getJsonableObject
()¶ Get JSON object representation.
Returns: A json-friendly object (built of dictionaries, lists and python primitives), which can be converted to json to record the exact details of what was simualted.
-
-
class
simdna.synthetic.
AbstractLoadedMotifs
(fileName, pseudocountProb=0.0, background=OrderedDict([('A', 0.27), ('C', 0.23), ('G', 0.23), ('T', 0.27)]))¶ Bases:
object
Class representing loaded PWMs.
A class that contains instances of
pwm.PWM
loaded from a file. The pwms can be accessed by name.Parameters: - fileName – string, the path to the file to load
- pseudocountProb – if some of the pwms have 0 probability for some of the positions, will add the specified
pseudocountProb
to the rows of the pwm and renormalise. - background – a dictionary with ACGT as the keys and the frequency as the values. Defaults to
util.DEFAULT_BACKGROUND_FREQ
-
getJsonableObject
()¶ Get JSON object representation.
Returns: A json-friendly object (built of dictionaries, lists and python primitives), which can be converted to json to record the exact details of what was simualted.
-
getPwm
(name)¶ Get a specific PWM.
Returns: The pwm.PWM
instance with the specified name.
-
getReadPwmAction
(recordedPwms)¶ Action performed when each line of the pwm text file is read in.
This function is to be overridden by a specific implementation. It is executed on each line of the file when it is read in, and when PWMs are ready they will get inserted into
recordedPwms
.Parameters: recordedPwms – an OrderedDict
that will be filled with PWMs.The keys will be the names of the PWMs and the values will be instances of
pwm.PWM
-
class
simdna.synthetic.
AbstractPositionGenerator
(name)¶ Bases:
simdna.synthetic.DefaultNameMixin
Generate a start position at which to embed something
Given the length of the background sequence and the length of the substring you are trying to embed, will return a start position to embed the substring at.
-
generatePos
(lenBackground, lenSubstring, additionalInfo=None)¶ Generate the position to embed in.
Parameters: - lenBackground – int, length of background sequence
- lenSubstring – int, lenght of substring to embed
- additionalInfo – optional, instance of
AdditionalInfo
. Is used to leave a trace that this positionGenerator was called
Returns: An integer which is the start index to embed in.
-
getJsonableObject
()¶ Get JSON object representation.
Returns: A json-friendly object (built of dictionaries, lists and python primitives), which can be converted to json to record the exact details of what was simualted.
-
-
class
simdna.synthetic.
AbstractPriorEmbeddedThings
¶ Bases:
object
Keeps track of what has already been embedded in a sequence.
-
addEmbedding
(startPos, what)¶ Records the embedding of a
AbstractEmbeddable
.Embeds
what
fromstartPos
tostartPos+len(what)
. Creates anEmbedding
object.Parameters: - startPos – int, the starting position at which to embed.
- what – instance of
AbstractEmbeddable
-
canEmbed
(startPos, endPos)¶ Test whether startPos-endPos is available for embedding.
Parameters: - startPos – int, starting index
- endPos – int, ending index+1 (same semantics as array-slicing)
Returns: endPos is available for embedding
Return type: True if startPos
-
getEmbeddings
()¶ Returns: A collection of Embedding objects
-
getNumOccupiedPos
()¶ Returns: Number of posiitons that are filled with some kind of embedding
-
getTotalPos
()¶ Returns: Total number of positions (occupied and unoccupoed) available to embed things in.
-
-
class
simdna.synthetic.
AbstractQuantityGenerator
(name)¶ Bases:
simdna.synthetic.DefaultNameMixin
Class for sampling values from a distribution.
-
generateQuantity
()¶ Sample a quantity from a distribution.
Returns: The sampled value.
-
getJsonableObject
()¶ Get JSON object representation.
Returns: A json-friendly object (built of dictionaries, lists and python primitives), which can be converted to json to record the exact details of what was simualted.
-
-
class
simdna.synthetic.
AbstractSequenceSetGenerator
¶ Bases:
object
A generator for a collection of generated sequences.
-
generateSequences
()¶ The generator; implementation should have a yield.
Called as
generatedSequences = sequenceSetGenerator.generateSequences()
generateSequences
can then be iterated over.Returns: A generator of GeneratedSequence objects
-
getJsonableObject
()¶ Get JSON object representation.
Returns: A json-friendly object (built of dictionaries, lists and python primitives), which can be converted to json to record the exact details of what was simualted.
-
-
class
simdna.synthetic.
AbstractSetOfMutations
(mutationsArr)¶ Bases:
object
Represents a collection of
Mutation
objects.Parameters: mutationsArr – array of Mutation
objects-
getJsonableObject
()¶ Get JSON object representation.
Returns: A json-friendly object (built of dictionaries, lists and python primitives), which can be converted to json to record the exact details of what was simualted.
-
getMutationsArr
()¶ Returns
self.mutationsArr
Returns: self.mutationsArr
-
-
class
simdna.synthetic.
AbstractSingleSequenceGenerator
(namePrefix=None)¶ Bases:
object
Generate a single sequence.
Parameters: namePrefix – the GeneratedSequence object has a field for the object’s name; this is the prefix associated with that name. The suffix is the value of a counter that is incremented every time -
generateSequence
()¶ Generate the sequence.
Returns: An instance of GeneratedSequence
-
getJsonableObject
()¶ Get JSON object representation.
Returns: A json-friendly object (built of dictionaries, lists and python primitives), which can be converted to json to record the exact details of what was simualted.
-
-
class
simdna.synthetic.
AbstractSubstringGenerator
(name)¶ Bases:
simdna.synthetic.DefaultNameMixin
Generates a substring, usually for embedding in a background sequence.
-
generateSubstring
()¶ Returns: A tuple of (string, stringDescription)
; the result can bewrapped in an instance of
StringEmbeddable
.stringDescription
is a short descriptor that does not contain spaces and may be prefixed in front of string when generating the __str__ representation forStringEmbeddable
.
-
getJsonableObject
()¶ Get JSON object representation.
Returns: A json-friendly object (built of dictionaries, lists and python primitives), which can be converted to json to record the exact details of what was simualted.
-
-
class
simdna.synthetic.
AbstractTransformation
(name)¶ Bases:
simdna.synthetic.DefaultNameMixin
Class representing a transformation applied to a character array.
Takes an array of characters, applies some transformation.
-
getJsonableObject
()¶ Get JSON object representation.
Returns: A json-friendly object (built of dictionaries, lists and python primitives), which can be converted to json to record the exact details of what was simualted.
-
transform
(stringArr)¶ Applies a transformation to stringArr.
Parameters: stringArr – an array of characters. Returns: An array of characters that has the transformation applied. May mutate
stringArr
-
-
class
simdna.synthetic.
AdditionalInfo
¶ Bases:
object
Used to keep track of which embedders/ops were called and how many times.
- An instance of AdditionalInfo is meant to be an attribute of
- a
GeneratedSequence
object. It keeps track of things like embedders, position generators, etc. - Has self.trace which is a dictionary from operatorName->int
- and which records operations that were called in the process of embedding things in the sequence. At the time of writing, operatorName is typically just the name of the embedder.
-
isInTrace
(operatorName)¶ Return True if operatorName has been called on the sequence.
-
updateAdditionalInfo
(operatorName, value)¶ Can be used to store any additional information on operatorName.
-
updateTrace
(operatorName)¶ Increment count for the number of times operatorName was called.
-
class
simdna.synthetic.
AllEmbedders
(embedders, name=None)¶ Bases:
simdna.synthetic.AbstractEmbedder
Wrapper around a list of embedders that calls each one in turn.
Useful to nest under a
RandomSubsetOfEmbedders
Parameters: embedders – an iterable of AbstractEmbedder
objects.-
getJsonableObject
()¶ See superclass.
-
-
class
simdna.synthetic.
BernoulliQuantityGenerator
(prob, name=None)¶ Bases:
simdna.synthetic.AbstractQuantityGenerator
Generates 1 or 0 according to a bernoulli distribution.
Parameters: prob – probability of 1 -
generateQuantity
()¶ See superclass.
-
getJsonableObject
()¶ See sueprclass.
-
-
class
simdna.synthetic.
BestHitPwm
(pwm, bestHitMode='pwmProb', name=None)¶ Bases:
simdna.synthetic.AbstractSubstringGenerator
Always return the best possible match to a
pwm.PWM
when called.Parameters: - pwm – an instance of
pwm.PWM
- bestHitMode – one of the values in
pwm.BEST_HIT_MODE
. IfpwmProb
then the best match will be determined according what is most likely to be sampled from the pwm matrix (this is the default). If
logOdds
, then the best match will be determined according to what would result in the best match according to the log-odds matrix (so, taking the background into account).name: seeDefaultNameMixin
-
generateSubstring
()¶ See superclass.
-
getJsonableObject
()¶ See superclass.
- pwm – an instance of
-
class
simdna.synthetic.
BestHitPwmFromLoadedMotifs
(loadedMotifs, motifName, bestHitMode='pwmProb', name=None)¶ Bases:
simdna.synthetic.BestHitPwm
Instantiates
BestHitPwm
using aLoadedMotifs
file. Analogous toPwmSamplerFromLoadedMotifs
.-
getJsonableObject
()¶ See superclass.
-
-
class
simdna.synthetic.
ChooseMutationAtRandom
(setOfMutations, name=None)¶ Bases:
simdna.synthetic.AbstractApplySingleMutationFromSet
Selects a mutation at random from self.setOfMutations to apply.
-
selectMutation
()¶
-
-
class
simdna.synthetic.
ChooseValueFromASet
(setOfPossibleValues, name=None)¶ Bases:
simdna.synthetic.AbstractQuantityGenerator
Randomly samples a particular value from a set of values.
Parameters: setOfPossibleValues – array of values that will be randomly sampled from.
name: seeDefaultNameMixin
.-
generateQuantity
()¶ See superclass.
-
getJsonableObject
()¶ See superclass.
-
-
class
simdna.synthetic.
DefaultNameMixin
(name)¶ Bases:
object
Basic functionality for classes that have a self.name attribute.
The self.name attribute is typically used to leave a trace in an instance of
AdditionalInfo
Parameters: name – string -
getDefaultName
()¶
-
-
class
simdna.synthetic.
EmbedInABackground
(backgroundGenerator, embedders, namePrefix=None)¶ Bases:
simdna.synthetic.AbstractSingleSequenceGenerator
Generate a background sequence and embed smaller sequences in it.
Takes a backgroundGenerator and a series of embedders. Will generate the background and then call each of the embedders in succession. Then returns the result.
Parameters: - backgroundGenerator – instance of
AbstractBackgroundGenerator
- embedders – array of instances of
AbstractEmbedder
- namePrefix – see parent
-
generateSequence
()¶ Produce the sequence.
Generates a background using self.backgroundGenerator, splits it into an array, and passes it to each of self.embedders in turn for embedding things.
Returns: An instance of GeneratedSequence
-
getJsonableObject
()¶ See superclass.
- backgroundGenerator – instance of
-
class
simdna.synthetic.
EmbeddableEmbedder
(embeddableGenerator, positionGenerator=<simdna.synthetic.UniformPositionGenerator object>, name=None)¶ Bases:
simdna.synthetic.AbstractEmbedder
Embeds an instance of
AbstractEmbeddable
at a sampled pos.Embeds instances of
AbstractEmbeddable
within the background sequence, at a position sampled from a distribution. Only embeds at unoccupied positions.Parameters: - embeddableGenerator – instance of
AbstractEmbeddableGenerator
- positionGenerator – instance of
AbstractPositionGenerator
-
getJsonableObject
()¶ See superclass.
- embeddableGenerator – instance of
-
class
simdna.synthetic.
Embedding
(what, startPos)¶ Bases:
object
Represents something that has been embedded in a sequence.
Think of this as a combination of an embeddable + a start position.
Parameters: - what – object representing the thing that has been embedded. Should have`` __str__`` and
__len__
defined. Often is an instance ofAbstractEmbeddable
- startPos – int, the position relative to the start of the parent sequence at which seq has been embedded
-
classmethod
fromString
(string, whatClass=None)¶ Recreate an
Embedding
object from a string.Parameters: - string – assumed to have format:
description[-|_]startPos[-|_]whatString
, wherewhatString
will be provided towhatClass
- whatClass – the class (usually a
AbstractEmbeddable
) that will be used to instantiate the what from the whatString
Returns: The Embedding class called with
what=whatClass.fromString(whatString)
andstartPos=int(startPos)
- string – assumed to have format:
- what – object representing the thing that has been embedded. Should have`` __str__`` and
-
class
simdna.synthetic.
FixedQuantityGenerator
(quantity, name=None)¶ Bases:
simdna.synthetic.AbstractQuantityGenerator
Returns a fixed number every time generateQuantity is called.
Parameters: quantity – the value to return when generateQuantity is called. -
generateQuantity
()¶ See superclass.
-
getJsonableObject
()¶ See superclass.
-
-
class
simdna.synthetic.
FixedSubstringGenerator
(fixedSubstring, name=None)¶ Bases:
simdna.synthetic.AbstractSubstringGenerator
Generates the same string every time.
When generateSubstring() is called, always returns the same string. The string also serves as its own description
Parameters: - fixedSubstring – the string to be generated
- name – see
DefaultNameMixin
-
generateSubstring
()¶ See superclass.
-
getJsonableObject
()¶ See superclass.
-
class
simdna.synthetic.
GenerateSequenceNTimes
(singleSetGenerator, N)¶ Bases:
simdna.synthetic.AbstractSequenceSetGenerator
Call a
AbstractSingleSequenceGenerator
N times.Parameters: - singleSetGenerator – an instance of
AbstractSequenceSetGenerator
- N – integer, the number of times to call singleSetGenerator
-
generateSequences
()¶ A generator that calls self.singleSetGenerator N times.
Returns: a generator that will call self.singleSetGenerator N times.
-
getJsonableObject
()¶ See superclass.
- singleSetGenerator – an instance of
-
class
simdna.synthetic.
GeneratedSequence
(seqName, seq, embeddings, additionalInfo)¶ Bases:
object
An object representing a sequence that has been generated.
Parameters: - seqName – string representing the name/id of the sequence
- seq – string representing the final generated sequence
- embeddings – an array of
Embedding
objects. - additionalInfo – an instance of
AdditionalInfo
-
class
simdna.synthetic.
InsideCentralBp
(centralBp, name=None)¶ Bases:
simdna.synthetic.AbstractPositionGenerator
For embedding within only the central region of a background.
- Returns a position within the central region of a background
- sequence, sampled uniformly at random
Parameters: - centralBp – int, the number of bp, centered in the middle of the background, from which to sample the position. Is NOT +/- centralBp around the middle (is +/- centralBp/2 around the middle). If the background sequence is even and centralBp is odd, the shorter region will go on the left.
- name – string - see
DefaultNameMixin
-
getJsonableObject
()¶ See superclass.
-
class
simdna.synthetic.
IsInTraceLabelGenerator
(labelNames)¶ Bases:
simdna.synthetic.LabelGenerator
LabelGenerator where labels match which embedders are called.
- A special kind of LabelGenerator where the names of the labels
- are the names of embedders, and the label is 1 if a particular embedder has been called on the sequence and 0 otherwise.
-
class
simdna.synthetic.
LabelGenerator
(labelNames, labelsFromGeneratedSequenceFunction)¶ Bases:
object
Generate labels for a generated sequence.
Parameters: - labelNames – an array of strings that are the names of the labels
- labelsFromGeneratedSequenceFunction – function that accepts
an instance of
GeneratedSequence
and returns an array of the labels (eg: an array of ones and zeros indicating if the criteria for various labels are met)
-
generateLabels
(generatedSequence)¶ calls self.labelsFromGeneratedSequenceFunction.
Parameters: generatedSequence – an instance of GeneratedSequence
-
class
simdna.synthetic.
LoadedEncodeMotifs
(fileName, pseudocountProb=0.0, background=OrderedDict([('A', 0.27), ('C', 0.23), ('G', 0.23), ('T', 0.27)]))¶ Bases:
simdna.synthetic.AbstractLoadedMotifs
A class for reading in a motifs file in the ENCODE motifs format.
This class is specifically for reading files in the encode motif format - specifically the motifs.txt file that contains Pouya’s motifs (http://compbio.mit.edu/encode-motifs/motifs.txt)
Basically, the motif declarations start with a >, the first characters after > until the first space are taken as the motif name, the lines after the line with a > have the format: “<ignored character> <prob of A> <prob of C> <prob of G> <prob of T>”
-
getReadPwmAction
(recordedPwms)¶ See superclass.
-
-
class
simdna.synthetic.
MinMaxWrapper
(quantityGenerator, theMin=None, theMax=None, name=None)¶ Bases:
simdna.synthetic.AbstractQuantityGenerator
Compress a distribution to lie within a min and a max.
Wrapper that restricts a distribution to only return values between the min and the max. If a value outside the range is returned, resamples until it obtains a value within the range. Warns every time it tries to resample 10 times without successfully finding a value in the correct range.
Parameters: quantityGenerator – instance of AbstractQuantityGenerator
.Used to draw samples from the distribution to truncate
theMin: can be None; if so will be ignored.
theMax: can be None; if so will be ignored.
-
generateQuantity
()¶ See superclass.
-
getJsonableObject
()¶ See superclass.
-
-
class
simdna.synthetic.
Mutation
(index, previous, new, parentLength=None)¶ Bases:
object
Represent a single bp mutation in a motif sequence.
Useful for creating simulations involving SNPs.
Parameters: - index – the position idx within the motif of the mutation
- previous – character, the previous base at this position
- new – character, the new base at this position after the mutation
- parentLength – optional; length of the motif. Used for assertion checks.
-
applyMutation
(stringArr)¶ Set the base at the position of the mutation to the mutated value.
Modifies stringArr which is an array of characters.
Parameters: stringArr – an array of characters, which gets modified.
-
parentLengthAssertionCheck
(stringArr)¶ Checks that stringArr is consistent with parentLength if defined.
-
revert
(stringArr)¶ Set the base at the position of the mutation to the unmutated value.
Modifies stringArr which is an array of characters.
Parameters: stringArr – an array of characters, which gets modified.
-
class
simdna.synthetic.
OutsideCentralBp
(centralBp, name=None)¶ Bases:
simdna.synthetic.AbstractPositionGenerator
For embedding only OUTSIDE a central region of a background seq.
- Returns a position OUTSIDE the central region of a background sequence,
- sampled uniformly at random. Complement of InsideCentralBp.
Parameters: centralBp – int, the centralBp to avoid embedding in. See the docs for InsideCentralBp
for more details (this is the complement).-
getJsonableObject
()¶ See superclass.
-
class
simdna.synthetic.
PairEmbeddable
(embeddable1, embeddable2, separation, embeddableDescription='', nothingInBetween=True)¶ Bases:
simdna.synthetic.AbstractEmbeddable
Embed two embeddables with some separation.
Parameters: - embeddable1 – instance of
AbstractEmbeddable
. First embeddable to be embedded. If a string is provided, will be wrapped inStringEmbeddable
- embeddable2 – second embeddable to be embedded. Type information similar to that of
embeddable1
- separation – int of distance separating embeddable1 and embeddable2
- embeddableDescription – a concise descriptive string prefixed in front when generating a __str__ representation of the embeddable. Should not contain a hyphen.
- nothingInBetween – if true, then nothing else is allowed to be embedded in the gap between embeddable1 and embeddable2.
-
canEmbed
(priorEmbeddedThings, startPos)¶ See superclass.
-
embedInBackgroundStringArr
(priorEmbeddedThings, backgroundStringArr, startPos)¶ See superclass.
If
self.nothingInBetween
, then all the intervening positions between the two embeddables will be marked as occupied. Otherwise, only the positions occupied by the embeddables will be marked as occupied.
-
getDescription
()¶ See superclass.
- embeddable1 – instance of
-
class
simdna.synthetic.
PairEmbeddableGenerator
(embeddableGenerator1, embeddableGenerator2, separationGenerator, name=None)¶ Bases:
simdna.synthetic.AbstractEmbeddableGenerator
Embed a pair of embeddables with some separation.
Parameters: emeddableGenerator1 – instance of AbstractEmbeddableGenerator
. If an:param
AbstractSubstringGenerator
is provided, will be wrapped in an instance ofSubstringEmbeddableGenerator
.: :param embeddableGenerator2: same type information as forembeddableGenerator1
:param separationGenerator: instance ofAbstractQuantityGenerator
:param name: string, seeDefaultNameMixin
-
generateEmbeddable
()¶ See superclass.
-
getJsonableObject
()¶ See superclass.
-
-
class
simdna.synthetic.
PoissonQuantityGenerator
(mean, name=None)¶ Bases:
simdna.synthetic.AbstractQuantityGenerator
Generates values according to a poisson distribution.
Parameters: mean – the mean of the poisson distribution -
generateQuantity
()¶ See superclass.
-
getJsonableObject
()¶ See superclass.
-
-
class
simdna.synthetic.
PriorEmbeddedThings_numpyArrayBacked
(seqLen)¶ Bases:
simdna.synthetic.AbstractPriorEmbeddedThings
A numpy-array based implementation of
AbstractPriorEmbeddedThings
.Uses a numpy array where positions are set to 1 if they are occupied, to determine which positions are occupied and which are not. See superclass for more documentation.
Parameters: seqLen – integer indicating length of the sequence you are embedding in -
addEmbedding
(startPos, what)¶ See superclass.
-
canEmbed
(startPos, endPos)¶ See superclass.
-
getEmbeddings
()¶ See superclass.
-
getNumOccupiedPos
()¶ See superclass.
-
getTotalPos
()¶ See superclass.
-
-
class
simdna.synthetic.
PwmSampler
(pwm, name=None)¶ Bases:
simdna.synthetic.AbstractSubstringGenerator
Samples from a pwm by calling
self.pwm.sampleFromPwm
Parameters: - pwm – an instance of
pwm.PWM
- name – see
DefaultNameMixin
-
generateSubstring
()¶ See superclass.
-
getJsonableObject
()¶ See superclass.
- pwm – an instance of
-
class
simdna.synthetic.
PwmSamplerFromLoadedMotifs
(loadedMotifs, motifName, name=None)¶ Bases:
simdna.synthetic.PwmSampler
Instantiates a
PwmSampler
from aLoadedEncodeMotifs
file.Convenience wrapper class for instantiating
PwmSampler
by pulling the pwm.PWM object using the provided name from anAbstractLoadedMotifs
objectParameters: - loadedMotifs – instance of
AbstractLoadedMotifs
- motifName – string, name of a motif in
AbstractLoadedMotifs
- name – see
DefaultNameMixin
-
getJsonableObject
()¶ See superclass.
- loadedMotifs – instance of
-
class
simdna.synthetic.
RandomSubsetOfEmbedders
(quantityGenerator, embedders, name=None)¶ Bases:
simdna.synthetic.AbstractEmbedder
Call some random subset of supplied embedders.
Takes a quantity generator that generates a quantity of embedders, and executes that many embedders from a supplied set, in sequence
Parameters: - quantityGenerator – instance of
AbstractQuantityGenerator
- embedders – a list of
AbstractEmbedder
objects
-
getJsonableObject
()¶ See superclass.
- quantityGenerator – instance of
-
class
simdna.synthetic.
RepeatedEmbedder
(embedder, quantityGenerator, name=None)¶ Bases:
simdna.synthetic.AbstractEmbedder
Call an embedded multiple times.
Wrapper around an embedder to call it multiple times according to samples from a distribution. First calls
self.quantityGenerator
to get the quantity, then callsself.embedder
a number of times equal to the value returned.Parameters: - embedder – instance of
AbstractEmbedder
- quantityGenerator – instance of
AbstractQuantityGenerator
-
getJsonableObject
()¶ See superclass.
- embedder – instance of
-
class
simdna.synthetic.
RepeatedSubstringBackgroundGenerator
(substringGenerator, repetitions)¶ Bases:
simdna.synthetic.AbstractBackgroundGenerator
Repeatedly call a substring generator and concatenate the result.
Can be used to generate variable-length sequences.
Parameters: - substringGenerator – instance of
AbstractSubstringGenerator
- repetitions – instance of
AbstractQuantityGenerator
. If pass an int, will create aFixedQuantityGenerator
from the int. This will be called to determine the number of times to generate a substring fromself.substringGenerator
Returns: The concatenation of all the calls to
self.substringGenerator
-
generateBackground
()¶
-
getJsonableObject
()¶ See superclass.
- substringGenerator – instance of
-
class
simdna.synthetic.
ReverseComplementWrapper
(substringGenerator, reverseComplementProb=0.5, name=None)¶ Bases:
simdna.synthetic.AbstractSubstringGenerator
Reverse complements a string with a specified probability.
Wrapper around an instance of
AbstractSubstringGenerator
that reverse complements the generated string with a specified probability.Parameters: - substringGenerator – instance of .AbstractSubstringGenerator
- reverseComplementProb – probability of reverse complementation.
Defaults to 0.5.
name: seeDefaultNameMixin
.-
generateSubstring
()¶
-
getJsonableObject
()¶ See superclass.
-
class
simdna.synthetic.
RevertToReference
(setOfMutations, name=None)¶ Bases:
simdna.synthetic.AbstractTransformation
For a series of mutations, reverts the supplied character to the reference (“unmutated”) string.
Parameters: - setOfMutations – instance of AbstractSetOfMutations
- name – see
DefaultNameMixin
.
-
getJsonableObject
()¶ See superclass.
-
transform
(stringArr)¶ See superclass.
-
class
simdna.synthetic.
SampleFromDiscreteDistributionSubstringGenerator
(discreteDistribution)¶ Bases:
simdna.synthetic.AbstractSubstringGenerator
Generate a substring by sampling from a distribution.
If the “substrings” are single characters (A/C/G/T), can be used in conjunction with
RepeatedSubstringBackgroundGenerator
to generate sequences with a certain GC content.Parameters: discreteDistribution – instance of util.DiscreteDistribution
-
generateSubstring
()¶
-
getJsonableObject
()¶ See superclass.
-
-
class
simdna.synthetic.
StringEmbeddable
(string, stringDescription='')¶ Bases:
simdna.synthetic.AbstractEmbeddable
A string that is to be embedded in a background.
Represents a string (such as a sampling from a pwm) that is to be embedded in a background. See docs for superclass.
Parameters: - string – the core string to be embedded
- stringDescription – a short descriptor prefixed before the
__str__
representation of the embeddable. Should not contain a hyphen. Defaults to “”.
-
canEmbed
(priorEmbeddedThings, startPos)¶ See superclass.
-
embedInBackgroundStringArr
(priorEmbeddedThings, backgroundStringArr, startPos)¶ See superclass.
-
classmethod
fromString
(theString)¶ Generates a StringEmbeddable from the provided string.
Parameters: theString – string of the format stringDescription-coreString
. Will then return:StringEmbeddable(string=coreString, stringDescription=stringDescription)
Returns: An instance of StringEmbeddable
-
getDescription
()¶ See superclass.
-
class
simdna.synthetic.
SubstringEmbeddableGenerator
(substringGenerator, name=None)¶ Bases:
simdna.synthetic.AbstractEmbeddableGenerator
Generates a
StringEmbeddable
Calls
substringGenerator
, wraps the result in aStringEmbeddable
and returns it.Parameters: substringGenerator – instance of AbstractSubstringGenerator
-
generateEmbeddable
()¶
-
getJsonableObject
()¶ See superclass.
-
-
class
simdna.synthetic.
SubstringEmbedder
(substringGenerator, positionGenerator=<simdna.synthetic.UniformPositionGenerator object>, name=None)¶ Bases:
simdna.synthetic.EmbeddableEmbedder
Used to embed substrings.
Embeds a single generated substring within the background sequence, at a position sampled from a distribution. Only embeds at unoccupied positions
Parameters: - substringGenerator – instance of
AbstractSubstringGenerator
- positionGenerator – instance of
AbstractPositionGenerator
- name – see
DefaultNameMixin
.
- substringGenerator – instance of
-
class
simdna.synthetic.
TransformedSubstringGenerator
(substringGenerator, transformations, transformationsDescription='transformations', name=None)¶ Bases:
simdna.synthetic.AbstractSubstringGenerator
Generates a substring and applies a series of transformations.
Takes a substringGenerator and a set of AbstractTransformation objects, applies the transformations to the generated substring
Parameters: - substringGenerator – instance of
AbstractSubstringGenerator
- transformations – an iterable of
AbstractTransformation
- transformationsDescription – a string that will be prefixed in front of
substringDescription
(generated bysubstringGenerator.generateSubstring())
to produce thestringDescription
. - name – see
DefaultNameMixin
.
-
generateSubstring
()¶ See superclass.
-
getJsonableObject
()¶ See superclass.
- substringGenerator – instance of
-
class
simdna.synthetic.
UniformIntegerGenerator
(minVal, maxVal, name=None)¶ Bases:
simdna.synthetic.AbstractQuantityGenerator
Randomly samples an integer from minVal to maxVal, inclusive.
Parameters: - minVal – minimum integer that can be sampled
- maxVal – maximum integers that can be sampled
- name – See superclass.
-
generateQuantity
()¶ See superclass.
-
getJsonableObject
()¶ See superclass.
-
class
simdna.synthetic.
UniformPositionGenerator
(name=None)¶ Bases:
simdna.synthetic.AbstractPositionGenerator
Sample position uniformly at random.
- Samples a start position to embed the substring in uniformly at random;
- does not return positions that are too close to the end of the background sequence to embed the full substring.
Parameters: name – string, see DefaultNameMixin
-
getJsonableObject
()¶ See superclass.
-
class
simdna.synthetic.
XOREmbedder
(embedder1, embedder2, probOfFirst, name=None)¶ Bases:
simdna.synthetic.AbstractEmbedder
Calls exactly one of the supplied embedders.
Parameters: - embedder1 – instance of
AbstractEmbedder
- embedder2 – instance of
AbstractEmbedder
- probOfFirst – probability of calling the first embedder
-
getJsonableObject
()¶ See superclass.
- embedder1 – instance of
-
class
simdna.synthetic.
ZeroInflater
(quantityGenerator, zeroProb, name=None)¶ Bases:
simdna.synthetic.AbstractQuantityGenerator
Inflate a particular distribution with zeros.
Wrapper that inflates the number of zeros returned. Flips a coin; if positive, will return zero - otherwise will sample from the wrapped distribution (which may still return 0)
Parameters: - quantityGenerator – an instance of
AbstractQuantityGenerator
; represents the distribution to sample from with probability1-zeroProb
- zeroProb – the probability of just returning 0 without sampling from
quantityGenerator
- name – see
DefaultNameMixin
.
-
generateQuantity
()¶ See superclass.
-
getJsonableObject
()¶ See superclass.
- quantityGenerator – an instance of
-
class
simdna.synthetic.
ZeroOrderBackgroundGenerator
(seqLength, discreteDistribution=<simdna.util.DiscreteDistribution object>)¶ Bases:
simdna.synthetic.RepeatedSubstringBackgroundGenerator
Returns a sequence with a certain GC content.
Each base is sampled independently.
Parameters: - seqLength – int, length of the background
- discreteDistribution – instance of
util.DiscreteDistribution`, defaults to ``util.DEFAULT_BASE_DISCRETE_DISTRIBUTION
-
simdna.synthetic.
getEmbeddingsFromString
(string)¶ Get a series of
Embedding
objects from a string.- Splits the string on commas, and then passes the comma-separated vals
- to
Embedding.fromString()
Parameters: string – The string to turn into an array of Embedding objects Returns: an array of Embedding
objects
-
simdna.synthetic.
printSequences
(outputFileName, sequenceSetGenerator, includeEmbeddings=False, labelGenerator=None, includeFasta=False, prefix=None)¶ Print a series of synthetic sequences.
- Given an output filename, and an instance of
AbstractSequenceSetGenerator
, will call the sequenceSetGenerator and print the generated sequences to the output file. Will also create a file “info_outputFileName.txt” in the same directory as outputFileName that contains all the information about sequenceSetGenerator.
Parameters: - outputFileName – string
- sequenceSetGenerator – instance of
AbstractSequenceSetGenerator
- includeEmbeddings – a boolean indicating whether to print a column that lists the embeddings
- labelGenerator – optional instance of
LabelGenerator
- includeFasta – optional boolean indicating whether to also print out the generated sequences in fasta format (the file will be produced with a .fa extension)
- prefix – string - this will be prefixed in front of the generated sequence ids, followed by a hyphen
-
simdna.synthetic.
sampleIndexWithinRegionOfLength
(length, lengthOfThingToEmbed)¶ Uniformly at random samples integers from 0 to
length
-lengthOfThingToEmbedIn
.Parameters: - length – length of full region that could be embedded in
- lengthOfThingToEmbed – length of thing being embedded in larger region