simdna package

simdna.simulations module

simdna.simulations.get_distribution(GC_fraction)
simdna.simulations.motif_density(motif_name, seq_length, num_seqs, min_counts, max_counts, GC_fraction, central_bp=None)

returns sequences with motif density.

simdna.simulations.simple_motif_embedding(motif_name, seq_length, num_seqs, GC_fraction)

returns sequence array

simdna.simulations.simulate_differential_accessibility(pos_motif_names, neg_motif_names, seq_length, min_num_motifs, max_num_motifs, num_pos, num_neg, GC_fraction)

Generates data for differential accessibility task.

Parameters:
  • pos_motif_names (list) – List of strings.
  • neg_motif_names (list) – List of strings.
  • seq_length (int) –
  • min_num_motifs (int) –
  • max_num_motifs (int) –
  • num_pos (int) –
  • num_neg (int) –
  • GC_fraction (float) –
Returns:

  • sequence_arr (1darray) – Contains sequence strings.
  • y (1darray) – Contains labels.

simdna.simulations.simulate_heterodimer_grammar(motif1, motif2, seq_length, min_spacing, max_spacing, num_pos, num_neg, GC_fraction)
Simulates two classes of sequences with motif1 and motif2:
  • Positive class sequences with motif1 and motif2 positioned min_spacing and max_spacing
  • Negative class sequences with independent motif1 and motif2 positioned

anywhere in the sequence, not as a heterodimer grammar

Parameters:
  • seq_length (int, length of sequence) –
  • GC_fraction (float, GC fraction in background sequence) –
  • num_pos (int, number of positive class sequences) –
  • num_neg (int, number of negatice class sequences) –
  • motif1 (str, encode motif name) –
  • motif2 (str, encode motif name) –
  • min_spacing (int, minimum inter motif spacing) –
  • max_spacing (int, maximum inter motif spacing) –
Returns:

  • sequence_arr (1darray) – Array with sequence strings.
  • y (1darray) – Array with positive/negative class labels.

simdna.simulations.simulate_motif_counting(motif_name, seq_length, pos_counts, neg_counts, num_pos, num_neg, GC_fraction)

Generates data for motif counting task.

Parameters:
  • motif_name (str) –
  • seq_length (int) –
  • pos_counts (list) – (min_counts, max_counts) for positive set.
  • neg_counts (list) – (min_counts, max_counts) for negative set.
  • num_pos (int) –
  • num_neg (int) –
  • GC_fraction (float) –
Returns:

  • sequence_arr (1darray) – Contains sequence strings.
  • y (1darray) – Contains labels.

simdna.simulations.simulate_motif_density_localization(motif_name, seq_length, center_size, min_motif_counts, max_motif_counts, num_pos, num_neg, GC_fraction)
Simulates two classes of seqeuences:
  • Positive class sequences with multiple motif instances in center of the sequence.
  • Negative class sequences with multiple motif instances anywhere in the sequence.

The number of motif instances is uniformly sampled between minimum and maximum motif counts.

Parameters:
  • motif_name (str) – encode motif name
  • seq_length (int) – length of sequence
  • center_size (int) – length of central part of the sequence where motifs can be positioned
  • min_motif_counts (int) – minimum number of motif instances
  • max_motif_counts (int) – maximum number of motif instances
  • num_pos (int) – number of positive class sequences
  • num_neg (int) – number of negative class sequences
  • GC_fraction (float) – GC fraction in background sequence
Returns:

  • sequence_arr (1darray) – Contains sequence strings.
  • y (1darray) – Contains labels.

simdna.simulations.simulate_multi_motif_embedding(motif_names, seq_length, min_num_motifs, max_num_motifs, num_seqs, GC_fraction)

Generates data for multi motif recognition task.

Parameters:
  • motif_names (list) – List of strings.
  • seq_length (int) –
  • min_num_motifs (int) –
  • max_num_motifs (int) –
  • num_seqs (int) –
  • GC_fraction (float) –
Returns:

  • sequence_arr (1darray) – Contains sequence strings.
  • y (ndarray) – Contains labels for each motif.

simdna.simulations.simulate_single_motif_detection(motif_name, seq_length, num_pos, num_neg, GC_fraction)
Simulates two classes of seqeuences:
  • Positive class sequence with a motif embedded anywhere in the sequence
  • Negative class sequence without the motif
Parameters:
  • motif_name (str) – encode motif name
  • seq_length (int) – length of sequence
  • num_pos (int) – number of positive class sequences
  • num_neg (int) – number of negative class sequences
  • GC_fraction (float) – GC fraction in background sequence
Returns:

  • sequence_arr (1darray) – Array with sequence strings.
  • y (1darray) – Array with positive/negative class labels.

simdna.synthetic module

class simdna.synthetic.AbstractApplySingleMutationFromSet(setOfMutations, name=None)

Bases: simdna.synthetic.AbstractTransformation

Class for applying a single mutation from a set of mutations; used to transform substrings generated by another method

Parameters:
getJsonableObject()

See superclass.

selectMutation()

Chooses a mutation from the set of mutations to apply.

Returns:an instance of Mutation
transform(stringArr)

See superclass.

class simdna.synthetic.AbstractBackgroundGenerator

Bases: object

Returns the sequence that AbstractEmbeddable objects are to be embedded into.

generateBackground()

Returns a sequence that is the background.

getJsonableObject()

Get JSON object representation.

Returns:A json-friendly object (built of dictionaries, lists and

python primitives), which can be converted to json to record the exact details of what was simualted.

class simdna.synthetic.AbstractEmbeddable

Bases: object

Represents a thing which can be embedded.

An AbstractEmbeddable + a position = an Embedding

canEmbed(priorEmbeddedThings, startPos)

Checks whether embedding is possible at a given pos.

Accepts an instance of AbstractPriorEmbeddedThings and a startPos, and checks if startPos is viable given the contents of priorEmbeddedThings.

Parameters:priorEmbeddedThings – instance of

AbstractPriorEmbeddedThings

startPos: int; the position you are considering embedding self at
Returns:A boolean indicating whether self can be embedded at startPos,

given the things that have already been embedded.

embedInBackgroundStringArr(priorEmbeddedThings, backgroundStringArr, startPos)

Embed self in a background string.

Will embed self at startPos in backgroundStringArr, and will update priorEmbeddedThings accordingly.

Parameters:priorEmbeddedThings – instance of
AbstractPriorEmbeddedThings
backgroundStringArr: an array of characters representing
the background
startPos: integer; the position to embed self at
classmethod fromString(theString)

Generate an instance of the embeddable from the provided string.

getDescription()

Return a concise description of the embeddable.

This should be concise and shouldn’t contain spaces. It will often be used when generating the __str__ representation of the embedabled.

class simdna.synthetic.AbstractEmbeddableGenerator(name)

Bases: simdna.synthetic.DefaultNameMixin

Generates an embeddable, usually for embedding in a background sequence.

generateEmbeddable()

Generate an embeddable object.

Returns:An instance of AbstractEmbeddable
getJsonableObject()

Get JSON object representation.

Returns:A json-friendly object (built of dictionaries, lists and

python primitives), which can be converted to json to record the exact details of what was simualted.

class simdna.synthetic.AbstractEmbedder(name)

Bases: simdna.synthetic.DefaultNameMixin

Produces AbstractEmbeddable objects and embeds them in a sequence.

embed(backgroundStringArr, priorEmbeddedThings, additionalInfo=None)

Embeds things in the provided backgroundStringArr.

Modifies backgroundStringArr to include whatever has been embedded.

Parameters:
  • backgroundStringArr – array of characters representing the background string
  • priorEmbeddedThings – instance of AbstractPriorEmbeddedThings
  • additionalInfo – instance of AdditionalInfo; allows the embedder to send back info about what it did
Returns:

The modifed backgroundStringArr

getJsonableObject()

Get JSON object representation.

Returns:A json-friendly object (built of dictionaries, lists and

python primitives), which can be converted to json to record the exact details of what was simualted.

class simdna.synthetic.AbstractLoadedMotifs(fileName, pseudocountProb=0.0, background=OrderedDict([('A', 0.27), ('C', 0.23), ('G', 0.23), ('T', 0.27)]))

Bases: object

Class representing loaded PWMs.

A class that contains instances of pwm.PWM loaded from a file. The pwms can be accessed by name.

Parameters:
  • fileName – string, the path to the file to load
  • pseudocountProb – if some of the pwms have 0 probability for some of the positions, will add the specified pseudocountProb to the rows of the pwm and renormalise.
  • background – a dictionary with ACGT as the keys and the frequency as the values. Defaults to util.DEFAULT_BACKGROUND_FREQ
getJsonableObject()

Get JSON object representation.

Returns:A json-friendly object (built of dictionaries, lists and

python primitives), which can be converted to json to record the exact details of what was simualted.

getPwm(name)

Get a specific PWM.

Returns:The pwm.PWM instance with the specified name.
getReadPwmAction(recordedPwms)

Action performed when each line of the pwm text file is read in.

This function is to be overridden by a specific implementation. It is executed on each line of the file when it is read in, and when PWMs are ready they will get inserted into recordedPwms.

Parameters:recordedPwms – an OrderedDict that will be filled with PWMs.

The keys will be the names of the PWMs and the values will be instances of pwm.PWM

class simdna.synthetic.AbstractPositionGenerator(name)

Bases: simdna.synthetic.DefaultNameMixin

Generate a start position at which to embed something

Given the length of the background sequence and the length of the substring you are trying to embed, will return a start position to embed the substring at.

generatePos(lenBackground, lenSubstring, additionalInfo=None)

Generate the position to embed in.

Parameters:
  • lenBackground – int, length of background sequence
  • lenSubstring – int, lenght of substring to embed
  • additionalInfo – optional, instance of AdditionalInfo. Is used to leave a trace that this positionGenerator was called
Returns:

An integer which is the start index to embed in.

getJsonableObject()

Get JSON object representation.

Returns:A json-friendly object (built of dictionaries, lists and

python primitives), which can be converted to json to record the exact details of what was simualted.

class simdna.synthetic.AbstractPriorEmbeddedThings

Bases: object

Keeps track of what has already been embedded in a sequence.

addEmbedding(startPos, what)

Records the embedding of a AbstractEmbeddable.

Embeds what from startPos to startPos+len(what). Creates an Embedding object.

Parameters:
  • startPos – int, the starting position at which to embed.
  • what – instance of AbstractEmbeddable
canEmbed(startPos, endPos)

Test whether startPos-endPos is available for embedding.

Parameters:
  • startPos – int, starting index
  • endPos – int, ending index+1 (same semantics as array-slicing)
Returns:

endPos is available for embedding

Return type:

True if startPos

getEmbeddings()
Returns:A collection of Embedding objects
getNumOccupiedPos()
Returns:Number of posiitons that are filled with some kind of embedding
getTotalPos()
Returns:Total number of positions (occupied and unoccupoed) available

to embed things in.

class simdna.synthetic.AbstractQuantityGenerator(name)

Bases: simdna.synthetic.DefaultNameMixin

Class for sampling values from a distribution.

generateQuantity()

Sample a quantity from a distribution.

Returns:The sampled value.
getJsonableObject()

Get JSON object representation.

Returns:A json-friendly object (built of dictionaries, lists and

python primitives), which can be converted to json to record the exact details of what was simualted.

class simdna.synthetic.AbstractSequenceSetGenerator

Bases: object

A generator for a collection of generated sequences.

generateSequences()

The generator; implementation should have a yield.

Called as generatedSequences = sequenceSetGenerator.generateSequences()

generateSequences can then be iterated over.

Returns:A generator of GeneratedSequence objects
getJsonableObject()

Get JSON object representation.

Returns:A json-friendly object (built of dictionaries, lists and

python primitives), which can be converted to json to record the exact details of what was simualted.

class simdna.synthetic.AbstractSetOfMutations(mutationsArr)

Bases: object

Represents a collection of Mutation objects.

Parameters:mutationsArr – array of Mutation objects
getJsonableObject()

Get JSON object representation.

Returns:A json-friendly object (built of dictionaries, lists and

python primitives), which can be converted to json to record the exact details of what was simualted.

getMutationsArr()

Returns self.mutationsArr

Returns:self.mutationsArr
class simdna.synthetic.AbstractSingleSequenceGenerator(namePrefix=None)

Bases: object

Generate a single sequence.

Parameters:namePrefix – the GeneratedSequence object has a field for the object’s name; this is the prefix associated with that name. The suffix is the value of a counter that is incremented every time
generateSequence()

Generate the sequence.

Returns:An instance of GeneratedSequence
getJsonableObject()

Get JSON object representation.

Returns:A json-friendly object (built of dictionaries, lists and

python primitives), which can be converted to json to record the exact details of what was simualted.

class simdna.synthetic.AbstractSubstringGenerator(name)

Bases: simdna.synthetic.DefaultNameMixin

Generates a substring, usually for embedding in a background sequence.

generateSubstring()
Returns:A tuple of (string, stringDescription); the result can be

wrapped in an instance of StringEmbeddable. stringDescription is a short descriptor that does not contain spaces and may be prefixed in front of string when generating the __str__ representation for StringEmbeddable.

getJsonableObject()

Get JSON object representation.

Returns:A json-friendly object (built of dictionaries, lists and

python primitives), which can be converted to json to record the exact details of what was simualted.

class simdna.synthetic.AbstractTransformation(name)

Bases: simdna.synthetic.DefaultNameMixin

Class representing a transformation applied to a character array.

Takes an array of characters, applies some transformation.

getJsonableObject()

Get JSON object representation.

Returns:A json-friendly object (built of dictionaries, lists and

python primitives), which can be converted to json to record the exact details of what was simualted.

transform(stringArr)

Applies a transformation to stringArr.

Parameters:stringArr – an array of characters.
Returns:An array of characters that has the transformation applied.

May mutate stringArr

class simdna.synthetic.AdditionalInfo

Bases: object

Used to keep track of which embedders/ops were called and how many times.

An instance of AdditionalInfo is meant to be an attribute of
a GeneratedSequence object. It keeps track of things like embedders, position generators, etc.
Has self.trace which is a dictionary from operatorName->int
and which records operations that were called in the process of embedding things in the sequence. At the time of writing, operatorName is typically just the name of the embedder.
isInTrace(operatorName)

Return True if operatorName has been called on the sequence.

updateAdditionalInfo(operatorName, value)

Can be used to store any additional information on operatorName.

updateTrace(operatorName)

Increment count for the number of times operatorName was called.

class simdna.synthetic.AllEmbedders(embedders, name=None)

Bases: simdna.synthetic.AbstractEmbedder

Wrapper around a list of embedders that calls each one in turn.

Useful to nest under a RandomSubsetOfEmbedders

Parameters:embedders – an iterable of AbstractEmbedder objects.
getJsonableObject()

See superclass.

class simdna.synthetic.BernoulliQuantityGenerator(prob, name=None)

Bases: simdna.synthetic.AbstractQuantityGenerator

Generates 1 or 0 according to a bernoulli distribution.

Parameters:prob – probability of 1
generateQuantity()

See superclass.

getJsonableObject()

See sueprclass.

class simdna.synthetic.BestHitPwm(pwm, bestHitMode='pwmProb', name=None)

Bases: simdna.synthetic.AbstractSubstringGenerator

Always return the best possible match to a pwm.PWM when called.

Parameters:
  • pwm – an instance of pwm.PWM
  • bestHitMode – one of the values in pwm.BEST_HIT_MODE. If pwmProb

then the best match will be determined according what is most likely to be sampled from the pwm matrix (this is the default). If logOdds, then the best match will be determined according to what would result in the best match according to the log-odds matrix (so, taking the background into account).

generateSubstring()

See superclass.

getJsonableObject()

See superclass.

class simdna.synthetic.BestHitPwmFromLoadedMotifs(loadedMotifs, motifName, bestHitMode='pwmProb', name=None)

Bases: simdna.synthetic.BestHitPwm

Instantiates BestHitPwm using a LoadedMotifs file. Analogous to PwmSamplerFromLoadedMotifs.

getJsonableObject()

See superclass.

class simdna.synthetic.ChooseMutationAtRandom(setOfMutations, name=None)

Bases: simdna.synthetic.AbstractApplySingleMutationFromSet

Selects a mutation at random from self.setOfMutations to apply.

selectMutation()
class simdna.synthetic.ChooseValueFromASet(setOfPossibleValues, name=None)

Bases: simdna.synthetic.AbstractQuantityGenerator

Randomly samples a particular value from a set of values.

Parameters:setOfPossibleValues – array of values that will be randomly sampled

from.

name: see DefaultNameMixin.
generateQuantity()

See superclass.

getJsonableObject()

See superclass.

class simdna.synthetic.DefaultNameMixin(name)

Bases: object

Basic functionality for classes that have a self.name attribute.

The self.name attribute is typically used to leave a trace in an instance of AdditionalInfo

Parameters:name – string
getDefaultName()
class simdna.synthetic.EmbedInABackground(backgroundGenerator, embedders, namePrefix=None)

Bases: simdna.synthetic.AbstractSingleSequenceGenerator

Generate a background sequence and embed smaller sequences in it.

Takes a backgroundGenerator and a series of embedders. Will generate the background and then call each of the embedders in succession. Then returns the result.

Parameters:
generateSequence()

Produce the sequence.

Generates a background using self.backgroundGenerator, splits it into an array, and passes it to each of self.embedders in turn for embedding things.

Returns:An instance of GeneratedSequence
getJsonableObject()

See superclass.

class simdna.synthetic.EmbeddableEmbedder(embeddableGenerator, positionGenerator=<simdna.synthetic.UniformPositionGenerator object>, name=None)

Bases: simdna.synthetic.AbstractEmbedder

Embeds an instance of AbstractEmbeddable at a sampled pos.

Embeds instances of AbstractEmbeddable within the background sequence, at a position sampled from a distribution. Only embeds at unoccupied positions.

Parameters:
getJsonableObject()

See superclass.

class simdna.synthetic.Embedding(what, startPos)

Bases: object

Represents something that has been embedded in a sequence.

Think of this as a combination of an embeddable + a start position.

Parameters:
  • what – object representing the thing that has been embedded. Should have`` __str__`` and __len__ defined. Often is an instance of AbstractEmbeddable
  • startPos – int, the position relative to the start of the parent sequence at which seq has been embedded
classmethod fromString(string, whatClass=None)

Recreate an Embedding object from a string.

Parameters:
  • string – assumed to have format: description[-|_]startPos[-|_]whatString, where whatString will be provided to whatClass
  • whatClass – the class (usually a AbstractEmbeddable) that will be used to instantiate the what from the whatString
Returns:

The Embedding class called with what=whatClass.fromString(whatString) and startPos=int(startPos)

class simdna.synthetic.FixedQuantityGenerator(quantity, name=None)

Bases: simdna.synthetic.AbstractQuantityGenerator

Returns a fixed number every time generateQuantity is called.

Parameters:quantity – the value to return when generateQuantity is called.
generateQuantity()

See superclass.

getJsonableObject()

See superclass.

class simdna.synthetic.FixedSubstringGenerator(fixedSubstring, name=None)

Bases: simdna.synthetic.AbstractSubstringGenerator

Generates the same string every time.

When generateSubstring() is called, always returns the same string. The string also serves as its own description

Parameters:
generateSubstring()

See superclass.

getJsonableObject()

See superclass.

class simdna.synthetic.GenerateSequenceNTimes(singleSetGenerator, N)

Bases: simdna.synthetic.AbstractSequenceSetGenerator

Call a AbstractSingleSequenceGenerator N times.

Parameters:
generateSequences()

A generator that calls self.singleSetGenerator N times.

Returns:a generator that will call self.singleSetGenerator N times.
getJsonableObject()

See superclass.

class simdna.synthetic.GeneratedSequence(seqName, seq, embeddings, additionalInfo)

Bases: object

An object representing a sequence that has been generated.

Parameters:
  • seqName – string representing the name/id of the sequence
  • seq – string representing the final generated sequence
  • embeddings – an array of Embedding objects.
  • additionalInfo – an instance of AdditionalInfo
class simdna.synthetic.InsideCentralBp(centralBp, name=None)

Bases: simdna.synthetic.AbstractPositionGenerator

For embedding within only the central region of a background.

Returns a position within the central region of a background
sequence, sampled uniformly at random
Parameters:
  • centralBp – int, the number of bp, centered in the middle of the background, from which to sample the position. Is NOT +/- centralBp around the middle (is +/- centralBp/2 around the middle). If the background sequence is even and centralBp is odd, the shorter region will go on the left.
  • name – string - see DefaultNameMixin
getJsonableObject()

See superclass.

class simdna.synthetic.IsInTraceLabelGenerator(labelNames)

Bases: simdna.synthetic.LabelGenerator

LabelGenerator where labels match which embedders are called.

A special kind of LabelGenerator where the names of the labels
are the names of embedders, and the label is 1 if a particular embedder has been called on the sequence and 0 otherwise.
class simdna.synthetic.LabelGenerator(labelNames, labelsFromGeneratedSequenceFunction)

Bases: object

Generate labels for a generated sequence.

Parameters:
  • labelNames – an array of strings that are the names of the labels
  • labelsFromGeneratedSequenceFunction – function that accepts an instance of GeneratedSequence and returns an array of the labels (eg: an array of ones and zeros indicating if the criteria for various labels are met)
generateLabels(generatedSequence)

calls self.labelsFromGeneratedSequenceFunction.

Parameters:generatedSequence – an instance of GeneratedSequence
class simdna.synthetic.LoadedEncodeMotifs(fileName, pseudocountProb=0.0, background=OrderedDict([('A', 0.27), ('C', 0.23), ('G', 0.23), ('T', 0.27)]))

Bases: simdna.synthetic.AbstractLoadedMotifs

A class for reading in a motifs file in the ENCODE motifs format.

This class is specifically for reading files in the encode motif format - specifically the motifs.txt file that contains Pouya’s motifs (http://compbio.mit.edu/encode-motifs/motifs.txt)

Basically, the motif declarations start with a >, the first characters after > until the first space are taken as the motif name, the lines after the line with a > have the format: “<ignored character> <prob of A> <prob of C> <prob of G> <prob of T>”

getReadPwmAction(recordedPwms)

See superclass.

class simdna.synthetic.MinMaxWrapper(quantityGenerator, theMin=None, theMax=None, name=None)

Bases: simdna.synthetic.AbstractQuantityGenerator

Compress a distribution to lie within a min and a max.

Wrapper that restricts a distribution to only return values between the min and the max. If a value outside the range is returned, resamples until it obtains a value within the range. Warns every time it tries to resample 10 times without successfully finding a value in the correct range.

Parameters:quantityGenerator – instance of AbstractQuantityGenerator.

Used to draw samples from the distribution to truncate

theMin: can be None; if so will be ignored.

theMax: can be None; if so will be ignored.

generateQuantity()

See superclass.

getJsonableObject()

See superclass.

class simdna.synthetic.Mutation(index, previous, new, parentLength=None)

Bases: object

Represent a single bp mutation in a motif sequence.

Useful for creating simulations involving SNPs.

Parameters:
  • index – the position idx within the motif of the mutation
  • previous – character, the previous base at this position
  • new – character, the new base at this position after the mutation
  • parentLength – optional; length of the motif. Used for assertion checks.
applyMutation(stringArr)

Set the base at the position of the mutation to the mutated value.

Modifies stringArr which is an array of characters.

Parameters:stringArr – an array of characters, which gets modified.
parentLengthAssertionCheck(stringArr)

Checks that stringArr is consistent with parentLength if defined.

revert(stringArr)

Set the base at the position of the mutation to the unmutated value.

Modifies stringArr which is an array of characters.

Parameters:stringArr – an array of characters, which gets modified.
class simdna.synthetic.OutsideCentralBp(centralBp, name=None)

Bases: simdna.synthetic.AbstractPositionGenerator

For embedding only OUTSIDE a central region of a background seq.

Returns a position OUTSIDE the central region of a background sequence,
sampled uniformly at random. Complement of InsideCentralBp.
Parameters:centralBp – int, the centralBp to avoid embedding in. See the docs for InsideCentralBp for more details (this is the complement).
getJsonableObject()

See superclass.

class simdna.synthetic.PairEmbeddable(embeddable1, embeddable2, separation, embeddableDescription='', nothingInBetween=True)

Bases: simdna.synthetic.AbstractEmbeddable

Embed two embeddables with some separation.

Parameters:
  • embeddable1 – instance of AbstractEmbeddable. First embeddable to be embedded. If a string is provided, will be wrapped in StringEmbeddable
  • embeddable2 – second embeddable to be embedded. Type information similar to that of embeddable1
  • separation – int of distance separating embeddable1 and embeddable2
  • embeddableDescription – a concise descriptive string prefixed in front when generating a __str__ representation of the embeddable. Should not contain a hyphen.
  • nothingInBetween – if true, then nothing else is allowed to be embedded in the gap between embeddable1 and embeddable2.
canEmbed(priorEmbeddedThings, startPos)

See superclass.

embedInBackgroundStringArr(priorEmbeddedThings, backgroundStringArr, startPos)

See superclass.

If self.nothingInBetween, then all the intervening positions between the two embeddables will be marked as occupied. Otherwise, only the positions occupied by the embeddables will be marked as occupied.

getDescription()

See superclass.

class simdna.synthetic.PairEmbeddableGenerator(embeddableGenerator1, embeddableGenerator2, separationGenerator, name=None)

Bases: simdna.synthetic.AbstractEmbeddableGenerator

Embed a pair of embeddables with some separation.

Parameters:emeddableGenerator1 – instance of AbstractEmbeddableGenerator. If an

:param AbstractSubstringGenerator is provided, will be wrapped in an instance of SubstringEmbeddableGenerator.: :param embeddableGenerator2: same type information as for embeddableGenerator1 :param separationGenerator: instance of AbstractQuantityGenerator :param name: string, see DefaultNameMixin

generateEmbeddable()

See superclass.

getJsonableObject()

See superclass.

class simdna.synthetic.PoissonQuantityGenerator(mean, name=None)

Bases: simdna.synthetic.AbstractQuantityGenerator

Generates values according to a poisson distribution.

Parameters:mean – the mean of the poisson distribution
generateQuantity()

See superclass.

getJsonableObject()

See superclass.

class simdna.synthetic.PriorEmbeddedThings_numpyArrayBacked(seqLen)

Bases: simdna.synthetic.AbstractPriorEmbeddedThings

A numpy-array based implementation of AbstractPriorEmbeddedThings.

Uses a numpy array where positions are set to 1 if they are occupied, to determine which positions are occupied and which are not. See superclass for more documentation.

Parameters:seqLen – integer indicating length of the sequence you are embedding in
addEmbedding(startPos, what)

See superclass.

canEmbed(startPos, endPos)

See superclass.

getEmbeddings()

See superclass.

getNumOccupiedPos()

See superclass.

getTotalPos()

See superclass.

class simdna.synthetic.PwmSampler(pwm, name=None)

Bases: simdna.synthetic.AbstractSubstringGenerator

Samples from a pwm by calling self.pwm.sampleFromPwm

Parameters:
generateSubstring()

See superclass.

getJsonableObject()

See superclass.

class simdna.synthetic.PwmSamplerFromLoadedMotifs(loadedMotifs, motifName, name=None)

Bases: simdna.synthetic.PwmSampler

Instantiates a PwmSampler from a LoadedEncodeMotifs file.

Convenience wrapper class for instantiating PwmSampler by pulling the pwm.PWM object using the provided name from an AbstractLoadedMotifs object

Parameters:
getJsonableObject()

See superclass.

class simdna.synthetic.RandomSubsetOfEmbedders(quantityGenerator, embedders, name=None)

Bases: simdna.synthetic.AbstractEmbedder

Call some random subset of supplied embedders.

Takes a quantity generator that generates a quantity of embedders, and executes that many embedders from a supplied set, in sequence

Parameters:
getJsonableObject()

See superclass.

class simdna.synthetic.RepeatedEmbedder(embedder, quantityGenerator, name=None)

Bases: simdna.synthetic.AbstractEmbedder

Call an embedded multiple times.

Wrapper around an embedder to call it multiple times according to samples from a distribution. First calls self.quantityGenerator to get the quantity, then calls self.embedder a number of times equal to the value returned.

Parameters:
getJsonableObject()

See superclass.

class simdna.synthetic.RepeatedSubstringBackgroundGenerator(substringGenerator, repetitions)

Bases: simdna.synthetic.AbstractBackgroundGenerator

Repeatedly call a substring generator and concatenate the result.

Can be used to generate variable-length sequences.

Parameters:
Returns:

The concatenation of all the calls to self.substringGenerator

generateBackground()
getJsonableObject()

See superclass.

class simdna.synthetic.ReverseComplementWrapper(substringGenerator, reverseComplementProb=0.5, name=None)

Bases: simdna.synthetic.AbstractSubstringGenerator

Reverse complements a string with a specified probability.

Wrapper around an instance of AbstractSubstringGenerator that reverse complements the generated string with a specified probability.

Parameters:
  • substringGenerator – instance of .AbstractSubstringGenerator
  • reverseComplementProb – probability of reverse complementation.

Defaults to 0.5.

name: see DefaultNameMixin.
generateSubstring()
getJsonableObject()

See superclass.

class simdna.synthetic.RevertToReference(setOfMutations, name=None)

Bases: simdna.synthetic.AbstractTransformation

For a series of mutations, reverts the supplied character to the reference (“unmutated”) string.

Parameters:
  • setOfMutations – instance of AbstractSetOfMutations
  • name – see DefaultNameMixin.
getJsonableObject()

See superclass.

transform(stringArr)

See superclass.

class simdna.synthetic.SampleFromDiscreteDistributionSubstringGenerator(discreteDistribution)

Bases: simdna.synthetic.AbstractSubstringGenerator

Generate a substring by sampling from a distribution.

If the “substrings” are single characters (A/C/G/T), can be used in conjunction with RepeatedSubstringBackgroundGenerator to generate sequences with a certain GC content.

Parameters:discreteDistribution – instance of util.DiscreteDistribution
generateSubstring()
getJsonableObject()

See superclass.

class simdna.synthetic.StringEmbeddable(string, stringDescription='')

Bases: simdna.synthetic.AbstractEmbeddable

A string that is to be embedded in a background.

Represents a string (such as a sampling from a pwm) that is to be embedded in a background. See docs for superclass.

Parameters:
  • string – the core string to be embedded
  • stringDescription – a short descriptor prefixed before the __str__ representation of the embeddable. Should not contain a hyphen. Defaults to “”.
canEmbed(priorEmbeddedThings, startPos)

See superclass.

embedInBackgroundStringArr(priorEmbeddedThings, backgroundStringArr, startPos)

See superclass.

classmethod fromString(theString)

Generates a StringEmbeddable from the provided string.

Parameters:theString – string of the format stringDescription-coreString. Will then return: StringEmbeddable(string=coreString, stringDescription=stringDescription)
Returns:An instance of StringEmbeddable
getDescription()

See superclass.

class simdna.synthetic.SubstringEmbeddableGenerator(substringGenerator, name=None)

Bases: simdna.synthetic.AbstractEmbeddableGenerator

Generates a StringEmbeddable

Calls substringGenerator, wraps the result in a StringEmbeddable and returns it.

Parameters:substringGenerator – instance of AbstractSubstringGenerator
generateEmbeddable()
getJsonableObject()

See superclass.

class simdna.synthetic.SubstringEmbedder(substringGenerator, positionGenerator=<simdna.synthetic.UniformPositionGenerator object>, name=None)

Bases: simdna.synthetic.EmbeddableEmbedder

Used to embed substrings.

Embeds a single generated substring within the background sequence, at a position sampled from a distribution. Only embeds at unoccupied positions

Parameters:
class simdna.synthetic.TransformedSubstringGenerator(substringGenerator, transformations, transformationsDescription='transformations', name=None)

Bases: simdna.synthetic.AbstractSubstringGenerator

Generates a substring and applies a series of transformations.

Takes a substringGenerator and a set of AbstractTransformation objects, applies the transformations to the generated substring

Parameters:
  • substringGenerator – instance of AbstractSubstringGenerator
  • transformations – an iterable of AbstractTransformation
  • transformationsDescription – a string that will be prefixed in front of substringDescription (generated by substringGenerator.generateSubstring()) to produce the stringDescription.
  • name – see DefaultNameMixin.
generateSubstring()

See superclass.

getJsonableObject()

See superclass.

class simdna.synthetic.UniformIntegerGenerator(minVal, maxVal, name=None)

Bases: simdna.synthetic.AbstractQuantityGenerator

Randomly samples an integer from minVal to maxVal, inclusive.

Parameters:
  • minVal – minimum integer that can be sampled
  • maxVal – maximum integers that can be sampled
  • name – See superclass.
generateQuantity()

See superclass.

getJsonableObject()

See superclass.

class simdna.synthetic.UniformPositionGenerator(name=None)

Bases: simdna.synthetic.AbstractPositionGenerator

Sample position uniformly at random.

Samples a start position to embed the substring in uniformly at random;
does not return positions that are too close to the end of the background sequence to embed the full substring.
Parameters:name – string, see DefaultNameMixin
getJsonableObject()

See superclass.

class simdna.synthetic.XOREmbedder(embedder1, embedder2, probOfFirst, name=None)

Bases: simdna.synthetic.AbstractEmbedder

Calls exactly one of the supplied embedders.

Parameters:
getJsonableObject()

See superclass.

class simdna.synthetic.ZeroInflater(quantityGenerator, zeroProb, name=None)

Bases: simdna.synthetic.AbstractQuantityGenerator

Inflate a particular distribution with zeros.

Wrapper that inflates the number of zeros returned. Flips a coin; if positive, will return zero - otherwise will sample from the wrapped distribution (which may still return 0)

Parameters:
  • quantityGenerator – an instance of AbstractQuantityGenerator; represents the distribution to sample from with probability 1-zeroProb
  • zeroProb – the probability of just returning 0 without sampling from quantityGenerator
  • name – see DefaultNameMixin.
generateQuantity()

See superclass.

getJsonableObject()

See superclass.

class simdna.synthetic.ZeroOrderBackgroundGenerator(seqLength, discreteDistribution=<simdna.util.DiscreteDistribution object>)

Bases: simdna.synthetic.RepeatedSubstringBackgroundGenerator

Returns a sequence with a certain GC content.

Each base is sampled independently.

Parameters:
  • seqLength – int, length of the background
  • discreteDistribution – instance of util.DiscreteDistribution`,    defaults to ``util.DEFAULT_BASE_DISCRETE_DISTRIBUTION
simdna.synthetic.getEmbeddingsFromString(string)

Get a series of Embedding objects from a string.

Splits the string on commas, and then passes the comma-separated vals
to Embedding.fromString()
Parameters:string – The string to turn into an array of Embedding objects
Returns:an array of Embedding objects
simdna.synthetic.printSequences(outputFileName, sequenceSetGenerator, includeEmbeddings=False, labelGenerator=None, includeFasta=False, prefix=None)

Print a series of synthetic sequences.

Given an output filename, and an instance of
AbstractSequenceSetGenerator, will call the sequenceSetGenerator and print the generated sequences to the output file. Will also create a file “info_outputFileName.txt” in the same directory as outputFileName that contains all the information about sequenceSetGenerator.
Parameters:
  • outputFileName – string
  • sequenceSetGenerator – instance of AbstractSequenceSetGenerator
  • includeEmbeddings – a boolean indicating whether to print a column that lists the embeddings
  • labelGenerator – optional instance of LabelGenerator
  • includeFasta – optional boolean indicating whether to also print out the generated sequences in fasta format (the file will be produced with a .fa extension)
  • prefix – string - this will be prefixed in front of the generated sequence ids, followed by a hyphen
simdna.synthetic.sampleIndexWithinRegionOfLength(length, lengthOfThingToEmbed)

Uniformly at random samples integers from 0 to length-lengthOfThingToEmbedIn.

Parameters:
  • length – length of full region that could be embedded in
  • lengthOfThingToEmbed – length of thing being embedded in larger region

Module contents