trafpy.generator.src.dists package

Submodules

trafpy.generator.src.dists.node_dists module

Module for generating node distributions.

trafpy.generator.src.dists.node_dists.adjust_node_dist_for_rack_prob_config(rack_prob_config, eps, node_dist, print_data=False)[source]

Unlike the other adjust_node_dist_from_multinomial_exp_for_rack_prob_config function, this function does not use a multinomial experiment to adjust the prob dist, but rather uses a deterministic method of distorting the probabilities from the original node distribution such that the required inter-/intra-rack probabilities are met.

trafpy.generator.src.dists.node_dists.adjust_node_dist_from_multinomial_exp_for_rack_prob_config(rack_prob_config, eps, node_dist, num_exps_factor=2, print_data=False)[source]

Unlike the other adjust_node_dist_for_rack_prob_config function, this function adjusts the node dist by running multinomial experiments on the initial node distribution to sample from it. It therefore takes much much longer than the other function, especially for networks with >1,000 nodes.

Takes node dist and uses it to generate new node dist given inter- and intra-rack configuration.

Different DCNs have different inter and intra rack traffic. This function allows you to specify how much of your traffic should be inter and intra rack.

Parameters
  • rack_prob_config (dict) – Network endpoints/servers are often grouped into physically local clusters or `racks’. Different networks may have different levels of inter- (between) and intra- (within) rack communication. If rack_prob_config is left as None, will assume that server-server srs-dst requests are independent of which rack they might be in. If specified, dict should have a `racks_dict’ key, whose value is a dict with keys as rack labels (e.g. ‘rack_0’, ‘rack_1’ etc.) and whose value for each key is a list of the endpoints in the respective rack (e.g. [`server_0’, `server_24’, `server_56’, …]), and a `prob_inter_rack’ key whose value is a float (e.g. 0.9), setting the probability that a chosen src endpoint has a destination which is outside of its rack. If you want to e.g. designate an entire rack as a ‘hot rack’ (many traffic requests occur from this rack), would specify skewed_nodes to contain the list of servers in this rack and configure rack_prob_config appropriately.

  • eps (list) – List of network node endpoints that can act as sources & destinations.

  • node_dist (numpy array) – 2D matrix array of source-destination pair probabilities of being chosen.

  • num_exps_factor (int) – Factor by which to multiply number of ep pairs to get the number of multinomial experiments to run when generating new node dist.

  • print_data (bool) – Whether or not to print extra information about the generated data.

trafpy.generator.src.dists.node_dists.adjust_probability_array_sum(probs, target_sum=1, print_data=False)[source]

For array.

trafpy.generator.src.dists.node_dists.adjust_probability_dict_sum(probs, target_sum=1, print_data=False)[source]

For dict.

trafpy.generator.src.dists.node_dists.assign_matrix_to_probs(eps, node_dist)[source]

Assigns probabilities in 2D matrix to a src-dst pair prob dist dict.

trafpy.generator.src.dists.node_dists.assign_probs_to_matrix(eps, probs, matrix=None)[source]

Assigns probabilities to 2D matrix.

probs can be list of pair probabilities or dict of key-value pair-probability

N.B. if probs is list, assumes probs are given in order of matrix indices when looping for src in eps for dst in eps

trafpy.generator.src.dists.node_dists.convert_pair_prob_dist_dict_to_matrix_pair_prob_dist_dict(pair_prob_dist, eps)[source]
Parameters

pair_prob_dist (dict) – Dict whose keys are node pairs and whose values are probabilities or fractions.

Returns

Dict whose keys are matrix indices of the node

pairs and whose values are the pairs’ corresponding probabilities or fractions.

Return type

matrix_pair_prob_dist (dict)

trafpy.generator.src.dists.node_dists.convert_sampled_pairs_into_node_dist(sampled_pairs, eps)[source]
trafpy.generator.src.dists.node_dists.gen_demand_nodes(eps, node_dist, size, axis, path_to_save=None, check_sum_valid=True)[source]

Generates demand nodes following the node_dist distribution

Parameters
  • eps (list) – List of node endpoint labels.

  • node_dist (numpy array) – Probability distribution each node is chosen

  • size (int) – Number of demand nodes to generate

  • axis (int, 1 or 0) – Which axis of normalised node distribution to consider. E.g. If generating src nodes, axis=0. If dst nodes, axis=1

  • path_to_save (str) – Path to directory (with file name included) in which to save generated distribution. E.g. path_to_save=’data/dists/my_dist’.

  • check_sum_valid (bool) – Whether or not to ensure node dist sums to 1. If need efficiency, should set to False.

trafpy.generator.src.dists.node_dists.gen_multimodal_node_dist(eps, skewed_nodes=[], skewed_node_probs=[], num_skewed_nodes=None, rack_prob_config=None, path_to_save=None, plot_fig=False, show_fig=False, print_data=False)[source]

Generates a multimodal node distribution.

Generates a multimodal node demand distribution i.e. certain nodes have a certain specified probability of being chosen. If no skewed nodes given, randomly selects random no. node(s) to skew. If no skew node probabilities given, random selects probability with which to skew the node between 0.5 and 0.8. If no num skewed nodes given, randomly chooses number of nodes to skew.

Parameters
  • eps (list) – List of network node endpoints that can act as sources & destinations

  • skewed_nodes (list) – Node(s) to whose probability of being chosen you want to skew/specify

  • skewed_node_probs (list) – Probabilit(y)(ies) of node(s) being chosen/specified skews

  • num_skewed_nodes (int) – Number of nodes to skew. If None, will gen a number between 10% and 30% of the total number of nodes in network

  • rack_prob_config (dict) – Network endpoints/servers are often grouped into physically local clusters or `racks’. Different networks may have different levels of inter- (between) and intra- (within) rack communication. If rack_prob_config is left as None, will assume that server-server srs-dst requests are independent of which rack they might be in. If specified, dict should have a `racks_dict’ key, whose value is a dict with keys as rack labels (e.g. ‘rack_0’, ‘rack_1’ etc.) and whose value for each key is a list of the endpoints in the respective rack (e.g. [`server_0’, `server_24’, `server_56’, …]), and a `prob_inter_rack’ key whose value is a float (e.g. 0.9), setting the probability that a chosen src endpoint has a destination which is outside of its rack. If you want to e.g. designate an entire rack as a ‘hot rack’ (many traffic requests occur from this rack), would specify skewed_nodes to contain the list of servers in this rack and configure rack_prob_config appropriately.

  • path_to_save (str) – Path to directory (with file name included) in which to save generated distribution. E.g. path_to_save=’data/dists/my_dist’.

  • plot_fig (bool) – Whether or not to plot fig. If True, will return fig.

  • show_fig (bool) – Whether or not to plot and show fig. If True, will return and display fig.

  • print_data (bool) – Whether or not to print extra information about the generated data.

Returns

Tuple containing:
  • node_dist (numpy array): 2D matrix array of souce-destination pair probabilities of being chosen.

  • fig (matplotlib.figure.Figure, optional): Node distributions plotted as a 2D matrix. To return, set show_fig=True and/or plot_fig=True.

Return type

tuple

trafpy.generator.src.dists.node_dists.gen_multimodal_node_pair_dist(eps, skewed_pairs=[], skewed_pair_probs=[], num_skewed_pairs=None, rack_prob_config=None, path_to_save=None, plot_fig=False, show_fig=False, print_data=False)[source]

Generates a multimodal node pair distribution.

Generates a multimodal node pair demand distribution i.e. certain node pairs have a certain specified probability of being chosen. If no skewed pairs given, randomly selects pair to skew. If no skew pair probabilities given, random selects probability with which to skew the pair between 0.1 and 0.3. If no num skewed pairs given, randomly chooses number of pairs to skew.

Parameters
  • eps (list) – List of network node endpoints that can act as sources & destinations.

  • skewed_pairs (list of lists) – List of the node pairs [src,dst] to skew.

  • skewed_pair_probs (list) – Probabilities of node pairs being chosen.

  • num_skewed_pairs (int) – Number of pairs to randomly skew.

  • rack_prob_config (dict) – Network endpoints/servers are often grouped into physically local clusters or `racks’. Different networks may have different levels of inter- (between) and intra- (within) rack communication. If rack_prob_config is left as None, will assume that server-server srs-dst requests are independent of which rack they might be in. If specified, dict should have a `racks_dict’ key, whose value is a dict with keys as rack labels (e.g. ‘rack_0’, ‘rack_1’ etc.) and whose value for each key is a list of the endpoints in the respective rack (e.g. [`server_0’, `server_24’, `server_56’, …]), and a `prob_inter_rack’ key whose value is a float (e.g. 0.9), setting the probability that a chosen src endpoint has a destination which is outside of its rack. If you want to e.g. designate an entire rack as a ‘hot rack’ (many traffic requests occur from this rack), would specify skewed_nodes to contain the list of servers in this rack and configure rack_prob_config appropriately.

  • path_to_save (str) – Path to directory (with file name included) in which to save generated distribution. E.g. path_to_save=’data/dists/my_dist’.

  • plot_fig (bool) – Whether or not to plot fig. If True, will return fig.

  • show_fig (bool) – Whether or not to plot and show fig. If True, will return and display fig.

  • print_data (bool) – Whether or not to print extra information about the generated data.

Returns

Tuple containing:
  • node_dist (numpy array): 2D matrix array of souce-destination pair probabilities of being chosen.

  • fig (matplotlib.figure.Figure, optional): Node distributions plotted as a 2D matrix. To return, set show_fig=True and/or plot_fig=True.

Return type

tuple

trafpy.generator.src.dists.node_dists.gen_node_demands(eps, node_dist, num_demands, rack_prob_config=None, duplicate=False, path_to_save=None)[source]

Uses node distribution to generate src-dst node pair demands.

Parameters
  • eps (list) – List of network node endpoints that can act as sources & destinations.

  • node_dist (numpy array) – 2D matrix array of source-destination pair probabilities of being chosen.

  • num_demands (int) – Number of src-dst node pairs to generate.

  • duplicate (bool) – Whether or not to duplicate src-dst node pairs. Use this is demands you’re generating have a ‘take down’ event as well as an ‘establish’ event.

  • path_to_save (str) – Path to directory (with file name included) in which to save generated distribution. E.g. path_to_save=’data/dists/my_dist’.

Returns

Tuple containing:
  • sn (numpy array): Selected source nodes.

  • dn (numpy array): Selected destination nodes.

Return type

tuple

trafpy.generator.src.dists.node_dists.gen_uniform_multinomial_exp_node_dist(eps, rack_prob_config=None, path_to_save=None, plot_fig=False, show_fig=False, print_data=False)[source]

Runs multinomial exp with uniform initial probability to generate slight skew.

Runs a multinomial experiment where each node pair has same (uniform) probability of being chosen. Will generate a node demand distribution where a few pairs & nodes have a slight skew in demand

Parameters
  • eps (list) – List of network node endpoints that can act as sources & destinations

  • rack_prob_config (dict) – Network endpoints/servers are often grouped into physically local clusters or `racks’. Different networks may have different levels of inter- (between) and intra- (within) rack communication. If rack_prob_config is left as None, will assume that server-server srs-dst requests are independent of which rack they might be in. If specified, dict should have a `racks_dict’ key, whose value is a dict with keys as rack labels (e.g. ‘rack_0’, ‘rack_1’ etc.) and whose value for each key is a list of the endpoints in the respective rack (e.g. [`server_0’, `server_24’, `server_56’, …]), and a `prob_inter_rack’ key whose value is a float (e.g. 0.9), setting the probability that a chosen src endpoint has a destination which is outside of its rack. If you want to e.g. designate an entire rack as a ‘hot rack’ (many traffic requests occur from this rack), would specify skewed_nodes to contain the list of servers in this rack and configure rack_prob_config appropriately.

  • path_to_save (str) – Path to directory (with file name included) in which to save generated distribution. E.g. path_to_save=’data/dists/my_dist’.

  • plot_fig (bool) – Whether or not to plot fig. If True, will return fig.

  • show_fig (bool) – Whether or not to plot and show fig. If True, will return and display fig.

  • print_data (bool) – Whether or not to print extra information about the generated data.

Returns

Tuple containing:
  • node_dist (numpy array): 2D matrix array of souce-destination pair probabilities of being chosen.

  • fig (matplotlib.figure.figure, optional): node distribution plotted as a 2d matrix. to return, set show_fig=true and/or plot_fig=true.

Return type

tuple

trafpy.generator.src.dists.node_dists.gen_uniform_node_dist(eps, rack_prob_config=None, path_to_save=None, plot_fig=False, show_fig=False, print_data=False)[source]

Generates a uniform node distribution.

Parameters
  • eps (list) – List of network node endpoints that can act as sources & destinations

  • rack_prob_config (dict) – Network endpoints/servers are often grouped into physically local clusters or `racks’. Different networks may have different levels of inter- (between) and intra- (within) rack communication. If rack_prob_config is left as None, will assume that server-server srs-dst requests are independent of which rack they might be in. If specified, dict should have a `racks_dict’ key, whose value is a dict with keys as rack labels (e.g. ‘rack_0’, ‘rack_1’ etc.) and whose value for each key is a list of the endpoints in the respective rack (e.g. [`server_0’, `server_24’, `server_56’, …]), and a `prob_inter_rack’ key whose value is a float (e.g. 0.9), setting the probability that a chosen src endpoint has a destination which is outside of its rack. If you want to e.g. designate an entire rack as a ‘hot rack’ (many traffic requests occur from this rack), would specify skewed_nodes to contain the list of servers in this rack and configure rack_prob_config appropriately.

  • path_to_save (str) – Path to directory (with file name included) in which to save generated distribution. E.g. path_to_save=’data/dists/my_dist’.

  • plot_fig (bool) – Whether or not to plot fig. If True, will return fig.

  • show_fig (bool) – Whether or not to plot and show fig. If True, will return and display fig.

  • print_data (bool) – Whether or not to print extra information about the generated data.

Returns

Tuple containing:
  • node_dist (numpy array): 2D matrix array of souce-destination pair probabilities of being chosen.

  • fig (matplotlib.figure.Figure, optional): Node distributions plotted as a 2D matrix. To return, set show_fig=True and/or plot_fig=True.

Return type

tuple

trafpy.generator.src.dists.node_dists.get_inter_intra_rack_pair_prob_dicts(pair_prob_dict, ep_to_rack_dict)[source]
trafpy.generator.src.dists.node_dists.get_network_pair_mapper(eps)[source]

Gets dicts mapping network endpoint indices to and from node dist matrix.

trafpy.generator.src.dists.node_dists.get_pair_prob_dict_of_node_dist_matrix(node_dist, eps, all_combinations=False, bidirectional=False)[source]

Gets prob dict of each pair being chosen given node dist of probabilities.

If all_combinations, will record pair probabilities for all possible pair combinations i.e. src-dst and dst-src. If False, assumes src-dst==dst-src.

If bidirectional, will multiply probabilities by 2 as pair can be src-dst or dst-src. If bidirectional=True -> values sum to 1, if bidirectional=False -> values sum to 0.5.

trafpy.generator.src.dists.node_dists.get_suitable_destination_node_for_rack_config(sn, node_dist, eps, ep_to_rack, rack_to_ep, inter_rack)[source]

Given source node, finds destination node given inter and intra rack config.

trafpy.generator.src.dists.plot_dists module

trafpy.generator.src.dists.val_dists module

Module for generating value distributions.

trafpy.generator.src.dists.val_dists.combine_multiple_mode_dists(data_dict, min_val, max_val, xlim=None, rand_var_name='Unknown', round_to_nearest=None, num_decimal_places=2)[source]
trafpy.generator.src.dists.val_dists.combine_skews(data_dict, min_val, max_val, bg_factor=0.5, xlim=None, logscale=False, transparent=True, rand_var_name='Unknown', num_bins=0, round_to_nearest=None, num_decimal_places=2)[source]

Combines multiple probability distributions for multimodal plotting.

Parameters
  • data_dict (dict) – Keys are mode iterations, values are random variable values for the mode iteration.

  • min_val (int/float) – Minimum random variable value.

  • max_val (int/float) – Maximum random variable value.

  • bg_factor (int/float) – Factor used to determine amount of noise to add amongst shaped modes being combined. Higher factor will add more noise to distribution and make modes more connected, lower will reduce noise but make nodes less connected.

  • xlim (list) – X-axis limits of plot. E.g. xlim=[0,10] to plot random variable values between 0 and 10.

  • logscale (bool) – Whether or not plot should have logscale x-axis and bins.

  • rand_var_name (str) – Name of random variable to label plot’s x-axis.

  • num_bins (int) – Number of bins to use in plot. Default is 0, in which case the number of bins chosen will be automatically selected.

  • round_to_nearest (int/float) – Value to round random variables to nearest. E.g. if round_to_nearest=0.2, will round each random variable to nearest 0.2.

  • num_decimal_places (int) – Number of decimal places to random variable values. Need to explicitly state otherwise Python’s floating point arithmetic will cause spurious unique random variable value errors when discretising.

Returns

Probability distribution whose key-value pairs are random variable value-probability pairs.

Return type

dict

trafpy.generator.src.dists.val_dists.convert_data_to_key_occurrences(data)[source]

Converts random variable data into value keys and corresponding occurrences.

Parameters

data (list) – Random variables to convert into key-num_occurrences pairs.

Returns

Random variable value - number of occurrences key-value pairs generated from random variable data.

Return type

dict

trafpy.generator.src.dists.val_dists.convert_key_occurrences_to_data(keys, num_occurrences)[source]

Converts value keys and their number of occurrences into random vars.

Parameters
  • keys (list) – Random variable values.

  • num_occurrences (list) – Number of each random variable to generate.

Returns

Random variables generated.

Return type

list

trafpy.generator.src.dists.val_dists.gen_discrete_prob_dist(rand_vars, unique_vars=None, round_to_nearest=None, num_decimal_places=2, path_to_save=None)[source]

Generate discrete probability distribution from list of random variables.

Takes rand var values, rounds to nearest value (specified as arg, defaults by not rounding at all) to discretise the data, and generates a probability distribution for the data

Parameters
  • rand_vars (list) – Random variable values

  • unique_vars (list) – List of unique random variables that can occur. If given, will init each as having occurred 0 times and count number of times each occurred. If left as None, will only record probabilities of random variables that actually occurred in rand_vars.

  • round_to_nearest (int/float) – Value to round rand vars to nearest when discretising rand var values. E.g. is round_to_nearest=0.2, will round each rand var to nearest 0.2

  • num_decimal_places (int) – Number of decimal places for discretised rand vars. Need to explitly state this because otherwise Python’s floating point arithmetic will cause spurious unique random var values

Returns

Tuple containing:
  • xk (list): List of (discretised) unique random variable values that occurred

  • pmf (list): List of corresponding probabilities that each unique value in xk occurs

Return type

tuple

trafpy.generator.src.dists.val_dists.gen_exponential_dist(_beta, size, round_to_nearest=None, num_decimal_places=2, min_val=None, max_val=None, interactive_params=None, logscale=False, transparent=True)[source]

Generates an exponential distribution of random variable values.

The exponential distribution often fits scenarios whose events’ random variable values (e.g. ‘time between events’) are made of many small values (e.g. time intervals) and a few large values. Often used to predict time until next event occurs.

E.g. Real-world scenarios: Time between earthquakes, car accidents, mail delivery, and data centre demand arrival.

Parameters
  • _beta (int/float) – Mean random variable value.

  • size (int) – Number of random variable values to sample from distribution.

  • interactive_params (dict) – Dictionary of distribution parameter values (must provide if in interactive mode).

  • logscale (bool) – Whether or not plot should have logscale x-axis and bins.

  • transparent (bool) – Whether or not to make plot bins slightly transparent.

Returns

Random variable values generated by sampling from the distribution.

Return type

list

trafpy.generator.src.dists.val_dists.gen_lognormal_dist(_mu, _sigma, size, round_to_nearest=None, num_decimal_places=2, min_val=None, max_val=None, interactive_params=None, logscale=False, transparent=True)[source]

Generates a log-normal distribution of random variable values.

Log-normal distributions often fit scenarios whose random variable values have a low mean value but a high degree of variance, leading to a distribution that is positively skewed (i.e. has a long tail to the right of its peak).

The log-normal distribution is mathematically similar to the normal distribution, since its random variable is notmally distributed when its logarithm is taken. I.e. for a log-normally distributed random variable X, Y=ln(X) would have a normal distribution.

E.g. of real-world scenarios: Length of a chess game, number of hospitalisations during an epidemic, the time after which a mechanical system needs repair, data centre demand interarrival times, etc.

Parameters
  • _mu (int/float) – Mean value of underlying normal distribution.

  • _sigma (int/float) – Standard deviation of underlying normal distribution.

  • size (int) – Number of random variable values to sample from distribution.

  • interactive_params (dict) – Dictionary of distribution parameter values (must provide if in interactive mode).

  • logscale (bool) – Whether or not plot should have logscale x-axis and bins.

  • transparent (bool) – Whether or not to make plot bins slightly transparent.

Returns

Random variable values generated by sampling from the distribution.

Return type

list

trafpy.generator.src.dists.val_dists.gen_multimodal_val_dist(min_val, max_val, locations=[], skews=[], scales=[], num_skew_samples=[], bg_factor=0.5, round_to_nearest=None, num_decimal_places=2, occurrence_multiplier=10, path_to_save=None, plot_fig=False, show_fig=False, return_data=False, xlim=None, logscale=False, rand_var_name='Random Variable', prob_rand_var_less_than=None, num_bins=0, print_data=False)[source]

Generates a multimodal distribution of random variable values.

Multimodal distributions are arbitrary distributions with >= 2 different modes. A multimodal distribution with 2 modes is a special case called a ‘bimodal distribution’. Bimodal distributions are the most common multi- modal distribution.

E.g. Real-world scenarios of bimodal distributions: Starting salaries for lawyers, book prices, peak resaurant hours, age groups of disease victims, packet sizes in data centre networks, etc.

Parameters
  • min_val (int/float) – Minimum random variable value.

  • max_val (int/float) – Maximum random variable value.

  • locations (list) – Position value(s) of skewed distribution(s) (mean shape parameter).

  • skews (list) – Skew value(s) of skewed distribution(s) (skewness shape parameter).

  • scales (list) – Scale value(s) of skewed distribution(s) (standard deviation shape parameter).

  • num_skew_samples (list) – Number(s) of random variables to sample from distribution(s) to generate skew data and plot.

  • bg_factor (int/float) – Factor used to determine amount of noise to add amongst shaped modes being combined. Higher factor will add more noise to distribution and make modes more connected, lower will reduce noise but make nodes less connected.

  • round_to_nearest (int/float) – Value to round random variables to nearest. E.g. if round_to_nearest=0.2, will round each random variable to nearest 0.2.

  • num_decimal_places (int) – Number of decimal places to random variable values. Need to explicitly state otherwise Python’s floating point arithmetic will cause spurious unique random variable value errors when discretising.

  • occurrence_multiplier (int/float) – When sampling random variables from distribution to create plot and random variable data, use this multiplier to determine number of data points to sample. A higher value will cause the random variable data to match the probability distribution more closely, but will take longer to generate.

  • path_to_save (str) – Path to directory (with file name included) in which to save generated distribution. E.g. path_to_save=’data/dists/my_dist’.

  • plot_fig (bool) – Whether or not to plot fig. If True, will return fig.

  • show_fig (bool) – Whether or not to plot and show fig. If True, will

  • tuple – return and display fig.

  • return_data (bool) – from generated distribution.

  • xlim (list) – X-axis limits of plot. E.g. xlim=[0,10] to plot random variable values between 0 and 10.

  • logscale (bool) – Whether or not plot should have logscale x-axis and bins.

  • rand_var_name (str) – Name of random variable to label plot’s x-axis.

  • prob_rand_var_less_than (list) – List of values for which to print the probability that a variable sampled randomly from the generated distribution will be less than. This is useful for replicating distributions from the literature. E.g. prob_rand_var_less_than=[3.7,5.8] will return the probability that a randomly chosen variable is less than 3.7 and 5.8 respectively.

  • num_bins (int) – Number of bins to use in plot. Default is 0, in which case the number of bins chosen will be automatically selected.

  • print_data (bool) – Whether or not to print extra information about the generated data.

Returns

Tuple containing:
  • prob_dist (dict): Probability distribution whose key-value pairs are random variable value-probability pairs.

  • rand_vars (list, optional): Random variable values sampled from the generated probability distribution. To return, set return_data=True.

  • fig (matplotlib.figure.Figure, optional): Probability density and cumulative distribution function plot. To return, set show_fig=True and/or plot_fig=True.

Return type

tuple

trafpy.generator.src.dists.val_dists.gen_named_val_dist(dist, params=None, interactive_plot=False, size=150000, occurrence_multiplier=100, return_data=False, round_to_nearest=None, num_decimal_places=2, path_to_save=None, plot_fig=False, show_fig=False, min_val=None, max_val=None, xlim=None, logscale=False, rand_var_name='Random Variable', prob_rand_var_less_than=None, num_bins=0, print_data=False)[source]

Generates a ‘named’ (e.g. Weibull/exponential/log-normal/Pareto) distribution.

Parameters
  • dist (str) – One of the valid named distributions (e.g. ‘weibull’, ‘lognormal’, ‘pareto’, ‘exponential’)

  • params (dict) – Corresponding parameter arguments of distribution (e.g. for Weibull, params={‘_alpha’: 1.4, ‘_lambda’: 7000}). See individual name distribution function generators for more information.

  • interactive_plot (bool) – Whether or not you want to use the interactive functionality of this function in Jupyter notebook to visually shape your named distribution.

  • size (int) – Number of values to sample from generated distribution when generating random variable data.

  • round_to_nearest (int/float) – Value to round random variables to nearest. E.g. if round_to_nearest=0.2, will round each random variable to nearest 0.2.

  • num_decimal_places (int) – Number of decimal places to random variable values. Need to explicitly state otherwise Python’s floating point arithmetic will cause spurious unique random variable value errors when discretising.

  • occurrence_multiplier (int/float) – When sampling random variables from distribution to create plot and random variable data, use this multiplier to determine number of data points to sample. A higher value will cause the random variable data to match the probability distribution more closely, but will take longer to generate.

  • path_to_save (str) – Path to directory (with file name included) in which to save generated distribution. E.g. path_to_save=’data/dists/my_dist’.

  • plot_fig (bool) – Whether or not to plot fig. If True, will return fig.

  • show_fig (bool) – Whether or not to plot and show fig. If True, will return and display fig.

  • return_data (bool) – from generated distribution.

  • min_val (int/float) – Minimum random variable value.

  • max_val (int/float) – Maximum random variable value.

  • xlim (list) – X-axis limits of plot. E.g. xlim=[0,10] to plot random variable values between 0 and 10.

  • logscale (bool) – Whether or not plot should have logscale x-axis and bins.

  • rand_var_name (str) – Name of random variable to label plot’s x-axis.

  • prob_rand_var_less_than (list) – List of values for which to print the probability that a variable sampled randomly from the generated distribution will be less than. This is useful for replicating distributions from the literature. E.g. prob_rand_var_less_than=[3.7,5.8] will return the probability that a randomly chosen variable is less than 3.7 and 5.8 respectively.

  • num_bins (int) – Number of bins to use in plot. Default is 0, in which case the number of bins chosen will be automatically selected.

  • print_data (bool) – Whether or not to print extra information about the generated data.

Returns

Tuple containing:
  • prob_dist (dict): Probability distribution whose key-value pairs are random variable value-probability pairs.

  • rand_vars (list, optional): Random variable values sampled from the generated probability distribution. To return, set return_data=True.

  • fig (matplotlib.figure.Figure, optional): Probability density and cumulative distribution function plot. To return, set show_fig=True and/or plot_fig=True.

Return type

tuple

trafpy.generator.src.dists.val_dists.gen_normal_dist(loc, scale, size, round_to_nearest=None, num_decimal_places=2, min_val=None, max_val=None, interactive_params=None, logscale=False, transparent=True)[source]

Generates a normal/gaussian distribution of random variable values.

Parameters
  • size (int) – number of random variable values to sample from distribution.

  • interactive_params (dict) – Dictionary of distribution parameter values (must provide if in interactive mode).

  • logscale (bool) – Whether or not plot should have logscale x-axis and bins.

  • transparent (bool) – Whether or not to make plot bins slightly transparent.

Returns

random variable values generated by sampling from the distribution.

Return type

list

trafpy.generator.src.dists.val_dists.gen_pareto_dist(_alpha, _mode, size, round_to_nearest=None, num_decimal_places=2, min_val=None, max_val=None, interactive_params=None, logscale=False, transparent=True)[source]

Generates a pareto distribution of random variable values.

Pareto distributions often fit scenarios whose random variable values have high probability of having a small range of values, leading to a distribution that is heavily skewed (i.e. has a long tail).

E.g. real-world scenarios: A large portion of society’s wealth being held by a small portion of its population, human settlement sizes, value of oil reserves in oil fields, size of sand particles, male dating success on Tinder, sizes of data centre demands, etc.

Parameters
  • _alpha (int/float) – Shape parameter of Pareto distribution. Describes how ‘stretched out’ (i.e. how high variance) the distribution is.

  • _mode (int/float) – Mode of the distribution, which is also the distribution’s minimum possible value.

  • size (int) – number of random variable values to sample from distribution.

  • interactive_params (dict) – Dictionary of distribution parameter values (must provide if in interactive mode).

  • logscale (bool) – Whether or not plot should have logscale x-axis and bins.

  • transparent (bool) – Whether or not to make plot bins slightly transparent.

Returns

random variable values generated by sampling from the distribution.

Return type

list

trafpy.generator.src.dists.val_dists.gen_rand_vars_from_discretised_dist(unique_vars, probabilities, num_demands, jensen_shannon_distance_threshold=None, show_fig=False, xlabel='Random Variable', font_size=20, figsize=(4, 3), marker_size=15, logscale=False, path_to_save=None)[source]

Generates random variable values by sampling from a discretised distribution.

Parameters
  • unique_vars (list) – Possible random variable values.

  • probabilities (list) – Corresponding probabilities of each random variable value being chosen.

  • num_demands (int) – Number of random variables to sample.

  • jensen_shannon_distance_threshold (float) – Maximum jensen shannon distance required of generated random variables w.r.t. discretised dist they’re generated from. Must be between 0 and 1. Distance of 0 -> distributions are exactly the same. Distance of 1 -> distributions are not at all similar. https://medium.com/datalab-log/measuring-the-statistical-similarity-between-two-samples-using-jensen-shannon-and-kullback-leibler-8d05af514b15 N.B. To meet threshold, this function will keep doubling num_demands

  • show_fig (bool) – Whether or not to generated sampled var dist plotted with the original distribution.

  • path_to_save (str) – Path to directory (with file name included) in which to save generated data. E.g. path_to_save=’data/my_data’

Returns

Random variable values sampled from dist.

Return type

numpy array

trafpy.generator.src.dists.val_dists.gen_skew_data(location, skew, scale, min_val, max_val, num_skew_samples, xlim=None, logscale=False, transparent=True, rand_var_name='Unknown', num_bins=0, round_to_nearest=None, num_decimal_places=2)[source]

Generates and plots skewed data for interactive multimodal distributions.

Parameters
  • location (int/float) – Position value of skewed distribution (mean shape parameter).

  • skew (int/float) – Skew value of skewed distribution (skewness shape parameter).

  • scale (int/float) – Scale value of skewed distribution (standard deviation

  • scale – Scale value of skewed distribution (standard deviation shape parameter). shape parameter).

  • num_skew_samples (int) – Number of random variables to sample from distribution to generate skew data and plot.

  • xlim (list) – X-axis limits of plot. E.g. xlim=[0,10] to plot random variable values between 0 and 10.

  • logscale (bool) – Whether or not plot should have logscale x-axis and bins.

  • transparent (bool) – Whether or not to make plot bins slightly transparent.

  • rand_var_name (str) – Name of random variable to label plot’s x-axis.

  • num_bins (int) – Number of bins to use in plot. Default is 0, in which case the number of bins chosen will be automatically selected.

  • round_to_nearest (int/float) – Value to round random variables to nearest. E.g. if round_to_nearest=0.2, will round each random variable to nearest 0.2.

  • num_decimal_places (int) – Number of decimal places to random variable values. Need to explicitly state otherwise Python’s floating point arithmetic will cause spurious unique random variable value errors when discretising.

Returns

Random variable values sampled from distribution.

Return type

list

trafpy.generator.src.dists.val_dists.gen_skew_dists(min_val, max_val, num_modes=2, xlim=None, rand_var_name='Unknown', round_to_nearest=None, num_decimal_places=2)[source]
trafpy.generator.src.dists.val_dists.gen_skewnorm_data(a, loc, scale, num_samples, min_val=None, max_val=None, round_to_nearest=None, num_decimal_places=2, interactive_params=None, logscale=False, transparent=True)[source]

Generates skew data.

Parameters
  • a (int/float) – Skewness shape parameter. When a=0, distribution is identical to a normal distribution.

  • loc (int/float) – Position value of skewed distribution (mean shape parameter).

  • scale (int/float) – Scale value of skewed distribution (standard deviation shape parameter).

  • min_val (int/float) – Minimum random variable value.

  • max_val (int/float) – Maximum random variable value.

  • num_samples (int) – Number of values to sample from generated distribution to generate skew data.

Returns

List of random variable values sampled from skewed distribution.

Return type

list

trafpy.generator.src.dists.val_dists.gen_skewnorm_val_dist(location, skew, scale, num_skew_samples=150000, min_val=None, max_val=None, return_data=False, xlim=None, plot_fig=False, show_fig=False, logscale=False, path_to_save=None, transparent=True, rand_var_name='Random Variable', num_bins=0, round_to_nearest=None, occurrence_multiplier=10, prob_rand_var_less_than=None, num_decimal_places=2)[source]

Generates a skew norm distribution of random variable values.

Parameters
  • location (int/float) – Position value of skewed distribution (mean shape parameter).

  • skew (int/float) – Skew value of skewed distribution (skewness shape parameter).

  • scale (int/float) – Scale value of skewed distribution (standard deviation

  • scale – Scale value of skewed distribution (standard deviation shape parameter). shape parameter).

  • num_skew_samples (int) – Number of random variables to sample from distribution to generate skew data and plot.

  • return_data (bool) – from generated distribution.

  • xlim (list) – X-axis limits of plot. E.g. xlim=[0,10] to plot random variable values between 0 and 10.

  • plot_fig (bool) – Whether or not to plot fig. If True, will return fig.

  • show_fig (bool) – Whether or not to plot and show fig. If True, will

  • logscale (bool) – Whether or not plot should have logscale x-axis and bins.

  • transparent (bool) – Whether or not to make plot bins slightly transparent.

  • rand_var_name (str) – Name of random variable to label plot’s x-axis.

  • num_bins (int) – Number of bins to use in plot. Default is 0, in which case the number of bins chosen will be automatically selected.

  • round_to_nearest (int/float) – Value to round random variables to nearest. E.g. if round_to_nearest=0.2, will round each random variable to nearest 0.2.

  • prob_rand_var_less_than (list) – List of values for which to print the probability that a variable sampled randomly from the generated distribution will be less than. This is useful for replicating distributions from the literature. E.g. prob_rand_var_less_than=[3.7,5.8] will return the probability that a randomly chosen variable is less than 3.7 and 5.8 respectively.

  • occurrence_multiplier (int/float) – When sampling random variables from distribution to create plot and random variable data, use this multiplier to determine number of data points to sample. A higher value will cause the random variable data to match the probability distribution more closely, but will take longer to generate.

  • num_decimal_places (int) – Number of decimal places to random variable values. Need to explicitly state otherwise Python’s floating point arithmetic will cause spurious unique random variable value errors when discretising.

Returns

Tuple containing:
  • prob_dist (dict): Probability distribution whose key-value pairs are random variable value-probability pairs.

  • rand_vars (list, optional): Random variable values sampled from the generated probability distribution. To return, set return_data=True.

  • fig (matplotlib.figure.Figure, optional): Probability density and cumulative distribution function plot. To return, set show_fig=True and/or plot_fig=True.

Return type

tuple

trafpy.generator.src.dists.val_dists.gen_uniform_val_dist(min_val, max_val, round_to_nearest=None, num_decimal_places=2, occurrence_multiplier=100, path_to_save=None, plot_fig=False, show_fig=False, return_data=False, xlim=None, logscale=False, rand_var_name='Random Variable', prob_rand_var_less_than=None, num_bins=0, print_data=False)[source]

Generates a uniform distribution of random variable values.

Uniform distributions are the most simple distribution. Each random variable value in a uniform distribution has an equal probability of occurring.

Parameters
  • min_val (int/float) – Minimum random variable value.

  • max_val (int/float) – Maximum random variable value.

  • round_to_nearest (int/float) – Value to round random variables to nearest. E.g. if round_to_nearest=0.2, will round each random variable to nearest 0.2.

  • num_decimal_places (int) – Number of decimal places to random variable values. Need to explicitly state otherwise Python’s floating point arithmetic will cause spurious unique random variable value errors when discretising.

  • occurrence_multiplier (int/float) – When sampling random variables from distribution to create plot and random variable data, use this multiplier to determine number of data points to sample. A higher value will cause the random variable data to match the probability distribution more closely, but will take longer to generate.

  • path_to_save (str) – Path to directory (with file name included) in which to save generated distribution. E.g. path_to_save=’data/dists/my_dist’.

  • plot_fig (bool) – Whether or not to plot fig. If True, will return fig.

  • show_fig (bool) – Whether or not to plot and show fig. If True, will return and display fig.

  • return_data (bool) – from generated distribution.

  • xlim (list) – X-axis limits of plot. E.g. xlim=[0,10] to plot random variable values between 0 and 10.

  • logscale (bool) – Whether or not plot should have logscale x-axis and bins.

  • rand_var_name (str) – Name of random variable to label plot’s x-axis.

  • prob_rand_var_less_than (list) – List of values for which to print the probability that a variable sampled randomly from the generated distribution will be less than. This is useful for replicating distributions from the literature. E.g. prob_rand_var_less_than=[3.7,5.8] will return the probability that a randomly chosen variable is less than 3.7 and 5.8 respectively.

  • num_bins (int) – Number of bins to use in plot. Default is 0, in which case the number of bins chosen will be automatically selected.

  • print_data (bool) – whether or not to print extra information about the generated data.

Returns

Tuple containing:
  • prob_dist (dict): Probability distribution whose key-value pairs are random variable value-probability pairs.

  • rand_vars (list, optional): Random variable values sampled from the generated probability distribution. To return, set return_data=True.

  • fig (matplotlib.figure.Figure, optional): Probability density and cumulative distribution function plot. To return, set show_fig=True and/or plot_fig=True.

Return type

tuple

trafpy.generator.src.dists.val_dists.gen_val_dist_data(val_dist, min_val, max_val, num_vals_to_gen, path_to_save=None)[source]

Generates values between min_val and max_val following val_dist distribution

trafpy.generator.src.dists.val_dists.gen_weibull_dist(_alpha, _lambda, size, round_to_nearest=None, num_decimal_places=2, min_val=None, max_val=None, interactive_params=None, logscale=False, transparent=True)[source]

Generates a Weibull distribution of random variable values.

Weibull distributions often fir scenarios whose random variable values (e.g. ‘time until failure’) are modelled by ‘extreme value theory’ (EVT) in that the values being predicted are more extreme than any previously recorded and, similar to the log-normal distribution, have a low mean but high variance and therefore a long tail/positive skew. Often use to predict time until failure.

E.g. real-world scenarios: Paricle sizes generated by grinding, milling & crushing operations, survival times after cancer diagnosis, light bulb failure times, divorce rates, data centre arrival times, etc.

Parameters
  • _alpha (int/float) –

    Shape parameter. Describes slope of distribution.

    _alpha < 1: Probability of random variable occurring decreases as values get higher. Occurs in systems with high ‘infant mortality’ in that e.g. defective items occur soon after t=0 and are therefore weeded out of the population early on.

    _alpha == 1: Special case of the Weibull distribution which reduces the distribution to an exponential distribution.

    _alpha > 1: Probability of random variable value occurring increases with time (until peak passes). Occurs in systems with an ‘aging’ process whereby e.g. components are more likely to fail as time goes on.

  • _lambda (int/float) – Weibull scale parameter. Use to shape distribution standard deviation.

  • size (int) – Number of random variable values to sample from distribution.

  • interactive_params (dict) – Dictionary of distribution parameter values (must provide if in interactive mode).

  • logscale (bool) – Whether or not plot should have logscale x-axis and bins.

  • transparent (bool) – Whether or not to make plot bins slightly transparent.

Returns

random variable values generated by sampling from the distribution.

Return type

list

trafpy.generator.src.dists.val_dists.x_round(x, round_to_nearest=1, num_decimal_places=2, print_data=False, min_val=None)[source]

Rounds variable to nearest specified value.

Module contents