trafpy.generator.src.dists package
Submodules
trafpy.generator.src.dists.node_dists module
Module for generating node distributions.
- trafpy.generator.src.dists.node_dists.adjust_node_dist_for_rack_prob_config(rack_prob_config, eps, node_dist, print_data=False)[source]
Unlike the other adjust_node_dist_from_multinomial_exp_for_rack_prob_config function, this function does not use a multinomial experiment to adjust the prob dist, but rather uses a deterministic method of distorting the probabilities from the original node distribution such that the required inter-/intra-rack probabilities are met.
- trafpy.generator.src.dists.node_dists.adjust_node_dist_from_multinomial_exp_for_rack_prob_config(rack_prob_config, eps, node_dist, num_exps_factor=2, print_data=False)[source]
Unlike the other adjust_node_dist_for_rack_prob_config function, this function adjusts the node dist by running multinomial experiments on the initial node distribution to sample from it. It therefore takes much much longer than the other function, especially for networks with >1,000 nodes.
Takes node dist and uses it to generate new node dist given inter- and intra-rack configuration.
Different DCNs have different inter and intra rack traffic. This function allows you to specify how much of your traffic should be inter and intra rack.
- Parameters
rack_prob_config (dict) – Network endpoints/servers are often grouped into physically local clusters or `racks’. Different networks may have different levels of inter- (between) and intra- (within) rack communication. If rack_prob_config is left as None, will assume that server-server srs-dst requests are independent of which rack they might be in. If specified, dict should have a `racks_dict’ key, whose value is a dict with keys as rack labels (e.g. ‘rack_0’, ‘rack_1’ etc.) and whose value for each key is a list of the endpoints in the respective rack (e.g. [`server_0’, `server_24’, `server_56’, …]), and a `prob_inter_rack’ key whose value is a float (e.g. 0.9), setting the probability that a chosen src endpoint has a destination which is outside of its rack. If you want to e.g. designate an entire rack as a ‘hot rack’ (many traffic requests occur from this rack), would specify skewed_nodes to contain the list of servers in this rack and configure rack_prob_config appropriately.
eps (list) – List of network node endpoints that can act as sources & destinations.
node_dist (numpy array) – 2D matrix array of source-destination pair probabilities of being chosen.
num_exps_factor (int) – Factor by which to multiply number of ep pairs to get the number of multinomial experiments to run when generating new node dist.
print_data (bool) – Whether or not to print extra information about the generated data.
- trafpy.generator.src.dists.node_dists.adjust_probability_array_sum(probs, target_sum=1, print_data=False)[source]
For array.
- trafpy.generator.src.dists.node_dists.adjust_probability_dict_sum(probs, target_sum=1, print_data=False)[source]
For dict.
- trafpy.generator.src.dists.node_dists.assign_matrix_to_probs(eps, node_dist)[source]
Assigns probabilities in 2D matrix to a src-dst pair prob dist dict.
- trafpy.generator.src.dists.node_dists.assign_probs_to_matrix(eps, probs, matrix=None)[source]
Assigns probabilities to 2D matrix.
probs can be list of pair probabilities or dict of key-value pair-probability
N.B. if probs is list, assumes probs are given in order of matrix indices when looping for src in eps for dst in eps
- trafpy.generator.src.dists.node_dists.convert_pair_prob_dist_dict_to_matrix_pair_prob_dist_dict(pair_prob_dist, eps)[source]
- Parameters
pair_prob_dist (dict) – Dict whose keys are node pairs and whose values are probabilities or fractions.
- Returns
- Dict whose keys are matrix indices of the node
pairs and whose values are the pairs’ corresponding probabilities or fractions.
- Return type
matrix_pair_prob_dist (dict)
- trafpy.generator.src.dists.node_dists.convert_sampled_pairs_into_node_dist(sampled_pairs, eps)[source]
- trafpy.generator.src.dists.node_dists.gen_demand_nodes(eps, node_dist, size, axis, path_to_save=None, check_sum_valid=True)[source]
Generates demand nodes following the node_dist distribution
- Parameters
eps (list) – List of node endpoint labels.
node_dist (numpy array) – Probability distribution each node is chosen
size (int) – Number of demand nodes to generate
axis (int, 1 or 0) – Which axis of normalised node distribution to consider. E.g. If generating src nodes, axis=0. If dst nodes, axis=1
path_to_save (str) – Path to directory (with file name included) in which to save generated distribution. E.g. path_to_save=’data/dists/my_dist’.
check_sum_valid (bool) – Whether or not to ensure node dist sums to 1. If need efficiency, should set to False.
- trafpy.generator.src.dists.node_dists.gen_multimodal_node_dist(eps, skewed_nodes=[], skewed_node_probs=[], num_skewed_nodes=None, rack_prob_config=None, path_to_save=None, plot_fig=False, show_fig=False, print_data=False)[source]
Generates a multimodal node distribution.
Generates a multimodal node demand distribution i.e. certain nodes have a certain specified probability of being chosen. If no skewed nodes given, randomly selects random no. node(s) to skew. If no skew node probabilities given, random selects probability with which to skew the node between 0.5 and 0.8. If no num skewed nodes given, randomly chooses number of nodes to skew.
- Parameters
eps (list) – List of network node endpoints that can act as sources & destinations
skewed_nodes (list) – Node(s) to whose probability of being chosen you want to skew/specify
skewed_node_probs (list) – Probabilit(y)(ies) of node(s) being chosen/specified skews
num_skewed_nodes (int) – Number of nodes to skew. If None, will gen a number between 10% and 30% of the total number of nodes in network
rack_prob_config (dict) – Network endpoints/servers are often grouped into physically local clusters or `racks’. Different networks may have different levels of inter- (between) and intra- (within) rack communication. If rack_prob_config is left as None, will assume that server-server srs-dst requests are independent of which rack they might be in. If specified, dict should have a `racks_dict’ key, whose value is a dict with keys as rack labels (e.g. ‘rack_0’, ‘rack_1’ etc.) and whose value for each key is a list of the endpoints in the respective rack (e.g. [`server_0’, `server_24’, `server_56’, …]), and a `prob_inter_rack’ key whose value is a float (e.g. 0.9), setting the probability that a chosen src endpoint has a destination which is outside of its rack. If you want to e.g. designate an entire rack as a ‘hot rack’ (many traffic requests occur from this rack), would specify skewed_nodes to contain the list of servers in this rack and configure rack_prob_config appropriately.
path_to_save (str) – Path to directory (with file name included) in which to save generated distribution. E.g. path_to_save=’data/dists/my_dist’.
plot_fig (bool) – Whether or not to plot fig. If True, will return fig.
show_fig (bool) – Whether or not to plot and show fig. If True, will return and display fig.
print_data (bool) – Whether or not to print extra information about the generated data.
- Returns
- Tuple containing:
node_dist (numpy array): 2D matrix array of souce-destination pair probabilities of being chosen.
fig (matplotlib.figure.Figure, optional): Node distributions plotted as a 2D matrix. To return, set show_fig=True and/or plot_fig=True.
- Return type
tuple
- trafpy.generator.src.dists.node_dists.gen_multimodal_node_pair_dist(eps, skewed_pairs=[], skewed_pair_probs=[], num_skewed_pairs=None, rack_prob_config=None, path_to_save=None, plot_fig=False, show_fig=False, print_data=False)[source]
Generates a multimodal node pair distribution.
Generates a multimodal node pair demand distribution i.e. certain node pairs have a certain specified probability of being chosen. If no skewed pairs given, randomly selects pair to skew. If no skew pair probabilities given, random selects probability with which to skew the pair between 0.1 and 0.3. If no num skewed pairs given, randomly chooses number of pairs to skew.
- Parameters
eps (list) – List of network node endpoints that can act as sources & destinations.
skewed_pairs (list of lists) – List of the node pairs [src,dst] to skew.
skewed_pair_probs (list) – Probabilities of node pairs being chosen.
num_skewed_pairs (int) – Number of pairs to randomly skew.
rack_prob_config (dict) – Network endpoints/servers are often grouped into physically local clusters or `racks’. Different networks may have different levels of inter- (between) and intra- (within) rack communication. If rack_prob_config is left as None, will assume that server-server srs-dst requests are independent of which rack they might be in. If specified, dict should have a `racks_dict’ key, whose value is a dict with keys as rack labels (e.g. ‘rack_0’, ‘rack_1’ etc.) and whose value for each key is a list of the endpoints in the respective rack (e.g. [`server_0’, `server_24’, `server_56’, …]), and a `prob_inter_rack’ key whose value is a float (e.g. 0.9), setting the probability that a chosen src endpoint has a destination which is outside of its rack. If you want to e.g. designate an entire rack as a ‘hot rack’ (many traffic requests occur from this rack), would specify skewed_nodes to contain the list of servers in this rack and configure rack_prob_config appropriately.
path_to_save (str) – Path to directory (with file name included) in which to save generated distribution. E.g. path_to_save=’data/dists/my_dist’.
plot_fig (bool) – Whether or not to plot fig. If True, will return fig.
show_fig (bool) – Whether or not to plot and show fig. If True, will return and display fig.
print_data (bool) – Whether or not to print extra information about the generated data.
- Returns
- Tuple containing:
node_dist (numpy array): 2D matrix array of souce-destination pair probabilities of being chosen.
fig (matplotlib.figure.Figure, optional): Node distributions plotted as a 2D matrix. To return, set show_fig=True and/or plot_fig=True.
- Return type
tuple
- trafpy.generator.src.dists.node_dists.gen_node_demands(eps, node_dist, num_demands, rack_prob_config=None, duplicate=False, path_to_save=None)[source]
Uses node distribution to generate src-dst node pair demands.
- Parameters
eps (list) – List of network node endpoints that can act as sources & destinations.
node_dist (numpy array) – 2D matrix array of source-destination pair probabilities of being chosen.
num_demands (int) – Number of src-dst node pairs to generate.
duplicate (bool) – Whether or not to duplicate src-dst node pairs. Use this is demands you’re generating have a ‘take down’ event as well as an ‘establish’ event.
path_to_save (str) – Path to directory (with file name included) in which to save generated distribution. E.g. path_to_save=’data/dists/my_dist’.
- Returns
- Tuple containing:
sn (numpy array): Selected source nodes.
dn (numpy array): Selected destination nodes.
- Return type
tuple
- trafpy.generator.src.dists.node_dists.gen_uniform_multinomial_exp_node_dist(eps, rack_prob_config=None, path_to_save=None, plot_fig=False, show_fig=False, print_data=False)[source]
Runs multinomial exp with uniform initial probability to generate slight skew.
Runs a multinomial experiment where each node pair has same (uniform) probability of being chosen. Will generate a node demand distribution where a few pairs & nodes have a slight skew in demand
- Parameters
eps (list) – List of network node endpoints that can act as sources & destinations
rack_prob_config (dict) – Network endpoints/servers are often grouped into physically local clusters or `racks’. Different networks may have different levels of inter- (between) and intra- (within) rack communication. If rack_prob_config is left as None, will assume that server-server srs-dst requests are independent of which rack they might be in. If specified, dict should have a `racks_dict’ key, whose value is a dict with keys as rack labels (e.g. ‘rack_0’, ‘rack_1’ etc.) and whose value for each key is a list of the endpoints in the respective rack (e.g. [`server_0’, `server_24’, `server_56’, …]), and a `prob_inter_rack’ key whose value is a float (e.g. 0.9), setting the probability that a chosen src endpoint has a destination which is outside of its rack. If you want to e.g. designate an entire rack as a ‘hot rack’ (many traffic requests occur from this rack), would specify skewed_nodes to contain the list of servers in this rack and configure rack_prob_config appropriately.
path_to_save (str) – Path to directory (with file name included) in which to save generated distribution. E.g. path_to_save=’data/dists/my_dist’.
plot_fig (bool) – Whether or not to plot fig. If True, will return fig.
show_fig (bool) – Whether or not to plot and show fig. If True, will return and display fig.
print_data (bool) – Whether or not to print extra information about the generated data.
- Returns
- Tuple containing:
node_dist (numpy array): 2D matrix array of souce-destination pair probabilities of being chosen.
fig (matplotlib.figure.figure, optional): node distribution plotted as a 2d matrix. to return, set show_fig=true and/or plot_fig=true.
- Return type
tuple
- trafpy.generator.src.dists.node_dists.gen_uniform_node_dist(eps, rack_prob_config=None, path_to_save=None, plot_fig=False, show_fig=False, print_data=False)[source]
Generates a uniform node distribution.
- Parameters
eps (list) – List of network node endpoints that can act as sources & destinations
rack_prob_config (dict) – Network endpoints/servers are often grouped into physically local clusters or `racks’. Different networks may have different levels of inter- (between) and intra- (within) rack communication. If rack_prob_config is left as None, will assume that server-server srs-dst requests are independent of which rack they might be in. If specified, dict should have a `racks_dict’ key, whose value is a dict with keys as rack labels (e.g. ‘rack_0’, ‘rack_1’ etc.) and whose value for each key is a list of the endpoints in the respective rack (e.g. [`server_0’, `server_24’, `server_56’, …]), and a `prob_inter_rack’ key whose value is a float (e.g. 0.9), setting the probability that a chosen src endpoint has a destination which is outside of its rack. If you want to e.g. designate an entire rack as a ‘hot rack’ (many traffic requests occur from this rack), would specify skewed_nodes to contain the list of servers in this rack and configure rack_prob_config appropriately.
path_to_save (str) – Path to directory (with file name included) in which to save generated distribution. E.g. path_to_save=’data/dists/my_dist’.
plot_fig (bool) – Whether or not to plot fig. If True, will return fig.
show_fig (bool) – Whether or not to plot and show fig. If True, will return and display fig.
print_data (bool) – Whether or not to print extra information about the generated data.
- Returns
- Tuple containing:
node_dist (numpy array): 2D matrix array of souce-destination pair probabilities of being chosen.
fig (matplotlib.figure.Figure, optional): Node distributions plotted as a 2D matrix. To return, set show_fig=True and/or plot_fig=True.
- Return type
tuple
- trafpy.generator.src.dists.node_dists.get_inter_intra_rack_pair_prob_dicts(pair_prob_dict, ep_to_rack_dict)[source]
- trafpy.generator.src.dists.node_dists.get_network_pair_mapper(eps)[source]
Gets dicts mapping network endpoint indices to and from node dist matrix.
- trafpy.generator.src.dists.node_dists.get_pair_prob_dict_of_node_dist_matrix(node_dist, eps, all_combinations=False, bidirectional=False)[source]
Gets prob dict of each pair being chosen given node dist of probabilities.
If all_combinations, will record pair probabilities for all possible pair combinations i.e. src-dst and dst-src. If False, assumes src-dst==dst-src.
If bidirectional, will multiply probabilities by 2 as pair can be src-dst or dst-src. If bidirectional=True -> values sum to 1, if bidirectional=False -> values sum to 0.5.
trafpy.generator.src.dists.plot_dists module
trafpy.generator.src.dists.val_dists module
Module for generating value distributions.
- trafpy.generator.src.dists.val_dists.combine_multiple_mode_dists(data_dict, min_val, max_val, xlim=None, rand_var_name='Unknown', round_to_nearest=None, num_decimal_places=2)[source]
- trafpy.generator.src.dists.val_dists.combine_skews(data_dict, min_val, max_val, bg_factor=0.5, xlim=None, logscale=False, transparent=True, rand_var_name='Unknown', num_bins=0, round_to_nearest=None, num_decimal_places=2)[source]
Combines multiple probability distributions for multimodal plotting.
- Parameters
data_dict (dict) – Keys are mode iterations, values are random variable values for the mode iteration.
min_val (int/float) – Minimum random variable value.
max_val (int/float) – Maximum random variable value.
bg_factor (int/float) – Factor used to determine amount of noise to add amongst shaped modes being combined. Higher factor will add more noise to distribution and make modes more connected, lower will reduce noise but make nodes less connected.
xlim (list) – X-axis limits of plot. E.g. xlim=[0,10] to plot random variable values between 0 and 10.
logscale (bool) – Whether or not plot should have logscale x-axis and bins.
rand_var_name (str) – Name of random variable to label plot’s x-axis.
num_bins (int) – Number of bins to use in plot. Default is 0, in which case the number of bins chosen will be automatically selected.
round_to_nearest (int/float) – Value to round random variables to nearest. E.g. if round_to_nearest=0.2, will round each random variable to nearest 0.2.
num_decimal_places (int) – Number of decimal places to random variable values. Need to explicitly state otherwise Python’s floating point arithmetic will cause spurious unique random variable value errors when discretising.
- Returns
Probability distribution whose key-value pairs are random variable value-probability pairs.
- Return type
dict
- trafpy.generator.src.dists.val_dists.convert_data_to_key_occurrences(data)[source]
Converts random variable data into value keys and corresponding occurrences.
- Parameters
data (list) – Random variables to convert into key-num_occurrences pairs.
- Returns
Random variable value - number of occurrences key-value pairs generated from random variable data.
- Return type
dict
- trafpy.generator.src.dists.val_dists.convert_key_occurrences_to_data(keys, num_occurrences)[source]
Converts value keys and their number of occurrences into random vars.
- Parameters
keys (list) – Random variable values.
num_occurrences (list) – Number of each random variable to generate.
- Returns
Random variables generated.
- Return type
list
- trafpy.generator.src.dists.val_dists.gen_discrete_prob_dist(rand_vars, unique_vars=None, round_to_nearest=None, num_decimal_places=2, path_to_save=None)[source]
Generate discrete probability distribution from list of random variables.
Takes rand var values, rounds to nearest value (specified as arg, defaults by not rounding at all) to discretise the data, and generates a probability distribution for the data
- Parameters
rand_vars (list) – Random variable values
unique_vars (list) – List of unique random variables that can occur. If given, will init each as having occurred 0 times and count number of times each occurred. If left as None, will only record probabilities of random variables that actually occurred in rand_vars.
round_to_nearest (int/float) – Value to round rand vars to nearest when discretising rand var values. E.g. is round_to_nearest=0.2, will round each rand var to nearest 0.2
num_decimal_places (int) – Number of decimal places for discretised rand vars. Need to explitly state this because otherwise Python’s floating point arithmetic will cause spurious unique random var values
- Returns
- Tuple containing:
xk (list): List of (discretised) unique random variable values that occurred
pmf (list): List of corresponding probabilities that each unique value in xk occurs
- Return type
tuple
- trafpy.generator.src.dists.val_dists.gen_exponential_dist(_beta, size, round_to_nearest=None, num_decimal_places=2, min_val=None, max_val=None, interactive_params=None, logscale=False, transparent=True)[source]
Generates an exponential distribution of random variable values.
The exponential distribution often fits scenarios whose events’ random variable values (e.g. ‘time between events’) are made of many small values (e.g. time intervals) and a few large values. Often used to predict time until next event occurs.
E.g. Real-world scenarios: Time between earthquakes, car accidents, mail delivery, and data centre demand arrival.
- Parameters
_beta (int/float) – Mean random variable value.
size (int) – Number of random variable values to sample from distribution.
interactive_params (dict) – Dictionary of distribution parameter values (must provide if in interactive mode).
logscale (bool) – Whether or not plot should have logscale x-axis and bins.
transparent (bool) – Whether or not to make plot bins slightly transparent.
- Returns
Random variable values generated by sampling from the distribution.
- Return type
list
- trafpy.generator.src.dists.val_dists.gen_lognormal_dist(_mu, _sigma, size, round_to_nearest=None, num_decimal_places=2, min_val=None, max_val=None, interactive_params=None, logscale=False, transparent=True)[source]
Generates a log-normal distribution of random variable values.
Log-normal distributions often fit scenarios whose random variable values have a low mean value but a high degree of variance, leading to a distribution that is positively skewed (i.e. has a long tail to the right of its peak).
The log-normal distribution is mathematically similar to the normal distribution, since its random variable is notmally distributed when its logarithm is taken. I.e. for a log-normally distributed random variable X, Y=ln(X) would have a normal distribution.
E.g. of real-world scenarios: Length of a chess game, number of hospitalisations during an epidemic, the time after which a mechanical system needs repair, data centre demand interarrival times, etc.
- Parameters
_mu (int/float) – Mean value of underlying normal distribution.
_sigma (int/float) – Standard deviation of underlying normal distribution.
size (int) – Number of random variable values to sample from distribution.
interactive_params (dict) – Dictionary of distribution parameter values (must provide if in interactive mode).
logscale (bool) – Whether or not plot should have logscale x-axis and bins.
transparent (bool) – Whether or not to make plot bins slightly transparent.
- Returns
Random variable values generated by sampling from the distribution.
- Return type
list
- trafpy.generator.src.dists.val_dists.gen_multimodal_val_dist(min_val, max_val, locations=[], skews=[], scales=[], num_skew_samples=[], bg_factor=0.5, round_to_nearest=None, num_decimal_places=2, occurrence_multiplier=10, path_to_save=None, plot_fig=False, show_fig=False, return_data=False, xlim=None, logscale=False, rand_var_name='Random Variable', prob_rand_var_less_than=None, num_bins=0, print_data=False)[source]
Generates a multimodal distribution of random variable values.
Multimodal distributions are arbitrary distributions with >= 2 different modes. A multimodal distribution with 2 modes is a special case called a ‘bimodal distribution’. Bimodal distributions are the most common multi- modal distribution.
E.g. Real-world scenarios of bimodal distributions: Starting salaries for lawyers, book prices, peak resaurant hours, age groups of disease victims, packet sizes in data centre networks, etc.
- Parameters
min_val (int/float) – Minimum random variable value.
max_val (int/float) – Maximum random variable value.
locations (list) – Position value(s) of skewed distribution(s) (mean shape parameter).
skews (list) – Skew value(s) of skewed distribution(s) (skewness shape parameter).
scales (list) – Scale value(s) of skewed distribution(s) (standard deviation shape parameter).
num_skew_samples (list) – Number(s) of random variables to sample from distribution(s) to generate skew data and plot.
bg_factor (int/float) – Factor used to determine amount of noise to add amongst shaped modes being combined. Higher factor will add more noise to distribution and make modes more connected, lower will reduce noise but make nodes less connected.
round_to_nearest (int/float) – Value to round random variables to nearest. E.g. if round_to_nearest=0.2, will round each random variable to nearest 0.2.
num_decimal_places (int) – Number of decimal places to random variable values. Need to explicitly state otherwise Python’s floating point arithmetic will cause spurious unique random variable value errors when discretising.
occurrence_multiplier (int/float) – When sampling random variables from distribution to create plot and random variable data, use this multiplier to determine number of data points to sample. A higher value will cause the random variable data to match the probability distribution more closely, but will take longer to generate.
path_to_save (str) – Path to directory (with file name included) in which to save generated distribution. E.g. path_to_save=’data/dists/my_dist’.
plot_fig (bool) – Whether or not to plot fig. If True, will return fig.
show_fig (bool) – Whether or not to plot and show fig. If True, will
tuple – return and display fig.
return_data (bool) – from generated distribution.
xlim (list) – X-axis limits of plot. E.g. xlim=[0,10] to plot random variable values between 0 and 10.
logscale (bool) – Whether or not plot should have logscale x-axis and bins.
rand_var_name (str) – Name of random variable to label plot’s x-axis.
prob_rand_var_less_than (list) – List of values for which to print the probability that a variable sampled randomly from the generated distribution will be less than. This is useful for replicating distributions from the literature. E.g. prob_rand_var_less_than=[3.7,5.8] will return the probability that a randomly chosen variable is less than 3.7 and 5.8 respectively.
num_bins (int) – Number of bins to use in plot. Default is 0, in which case the number of bins chosen will be automatically selected.
print_data (bool) – Whether or not to print extra information about the generated data.
- Returns
- Tuple containing:
prob_dist (dict): Probability distribution whose key-value pairs are random variable value-probability pairs.
rand_vars (list, optional): Random variable values sampled from the generated probability distribution. To return, set return_data=True.
fig (matplotlib.figure.Figure, optional): Probability density and cumulative distribution function plot. To return, set show_fig=True and/or plot_fig=True.
- Return type
tuple
- trafpy.generator.src.dists.val_dists.gen_named_val_dist(dist, params=None, interactive_plot=False, size=150000, occurrence_multiplier=100, return_data=False, round_to_nearest=None, num_decimal_places=2, path_to_save=None, plot_fig=False, show_fig=False, min_val=None, max_val=None, xlim=None, logscale=False, rand_var_name='Random Variable', prob_rand_var_less_than=None, num_bins=0, print_data=False)[source]
Generates a ‘named’ (e.g. Weibull/exponential/log-normal/Pareto) distribution.
- Parameters
dist (str) – One of the valid named distributions (e.g. ‘weibull’, ‘lognormal’, ‘pareto’, ‘exponential’)
params (dict) – Corresponding parameter arguments of distribution (e.g. for Weibull, params={‘_alpha’: 1.4, ‘_lambda’: 7000}). See individual name distribution function generators for more information.
interactive_plot (bool) – Whether or not you want to use the interactive functionality of this function in Jupyter notebook to visually shape your named distribution.
size (int) – Number of values to sample from generated distribution when generating random variable data.
round_to_nearest (int/float) – Value to round random variables to nearest. E.g. if round_to_nearest=0.2, will round each random variable to nearest 0.2.
num_decimal_places (int) – Number of decimal places to random variable values. Need to explicitly state otherwise Python’s floating point arithmetic will cause spurious unique random variable value errors when discretising.
occurrence_multiplier (int/float) – When sampling random variables from distribution to create plot and random variable data, use this multiplier to determine number of data points to sample. A higher value will cause the random variable data to match the probability distribution more closely, but will take longer to generate.
path_to_save (str) – Path to directory (with file name included) in which to save generated distribution. E.g. path_to_save=’data/dists/my_dist’.
plot_fig (bool) – Whether or not to plot fig. If True, will return fig.
show_fig (bool) – Whether or not to plot and show fig. If True, will return and display fig.
return_data (bool) – from generated distribution.
min_val (int/float) – Minimum random variable value.
max_val (int/float) – Maximum random variable value.
xlim (list) – X-axis limits of plot. E.g. xlim=[0,10] to plot random variable values between 0 and 10.
logscale (bool) – Whether or not plot should have logscale x-axis and bins.
rand_var_name (str) – Name of random variable to label plot’s x-axis.
prob_rand_var_less_than (list) – List of values for which to print the probability that a variable sampled randomly from the generated distribution will be less than. This is useful for replicating distributions from the literature. E.g. prob_rand_var_less_than=[3.7,5.8] will return the probability that a randomly chosen variable is less than 3.7 and 5.8 respectively.
num_bins (int) – Number of bins to use in plot. Default is 0, in which case the number of bins chosen will be automatically selected.
print_data (bool) – Whether or not to print extra information about the generated data.
- Returns
- Tuple containing:
prob_dist (dict): Probability distribution whose key-value pairs are random variable value-probability pairs.
rand_vars (list, optional): Random variable values sampled from the generated probability distribution. To return, set return_data=True.
fig (matplotlib.figure.Figure, optional): Probability density and cumulative distribution function plot. To return, set show_fig=True and/or plot_fig=True.
- Return type
tuple
- trafpy.generator.src.dists.val_dists.gen_normal_dist(loc, scale, size, round_to_nearest=None, num_decimal_places=2, min_val=None, max_val=None, interactive_params=None, logscale=False, transparent=True)[source]
Generates a normal/gaussian distribution of random variable values.
- Parameters
size (int) – number of random variable values to sample from distribution.
interactive_params (dict) – Dictionary of distribution parameter values (must provide if in interactive mode).
logscale (bool) – Whether or not plot should have logscale x-axis and bins.
transparent (bool) – Whether or not to make plot bins slightly transparent.
- Returns
random variable values generated by sampling from the distribution.
- Return type
list
- trafpy.generator.src.dists.val_dists.gen_pareto_dist(_alpha, _mode, size, round_to_nearest=None, num_decimal_places=2, min_val=None, max_val=None, interactive_params=None, logscale=False, transparent=True)[source]
Generates a pareto distribution of random variable values.
Pareto distributions often fit scenarios whose random variable values have high probability of having a small range of values, leading to a distribution that is heavily skewed (i.e. has a long tail).
E.g. real-world scenarios: A large portion of society’s wealth being held by a small portion of its population, human settlement sizes, value of oil reserves in oil fields, size of sand particles, male dating success on Tinder, sizes of data centre demands, etc.
- Parameters
_alpha (int/float) – Shape parameter of Pareto distribution. Describes how ‘stretched out’ (i.e. how high variance) the distribution is.
_mode (int/float) – Mode of the distribution, which is also the distribution’s minimum possible value.
size (int) – number of random variable values to sample from distribution.
interactive_params (dict) – Dictionary of distribution parameter values (must provide if in interactive mode).
logscale (bool) – Whether or not plot should have logscale x-axis and bins.
transparent (bool) – Whether or not to make plot bins slightly transparent.
- Returns
random variable values generated by sampling from the distribution.
- Return type
list
- trafpy.generator.src.dists.val_dists.gen_rand_vars_from_discretised_dist(unique_vars, probabilities, num_demands, jensen_shannon_distance_threshold=None, show_fig=False, xlabel='Random Variable', font_size=20, figsize=(4, 3), marker_size=15, logscale=False, path_to_save=None)[source]
Generates random variable values by sampling from a discretised distribution.
- Parameters
unique_vars (list) – Possible random variable values.
probabilities (list) – Corresponding probabilities of each random variable value being chosen.
num_demands (int) – Number of random variables to sample.
jensen_shannon_distance_threshold (float) – Maximum jensen shannon distance required of generated random variables w.r.t. discretised dist they’re generated from. Must be between 0 and 1. Distance of 0 -> distributions are exactly the same. Distance of 1 -> distributions are not at all similar. https://medium.com/datalab-log/measuring-the-statistical-similarity-between-two-samples-using-jensen-shannon-and-kullback-leibler-8d05af514b15 N.B. To meet threshold, this function will keep doubling num_demands
show_fig (bool) – Whether or not to generated sampled var dist plotted with the original distribution.
path_to_save (str) – Path to directory (with file name included) in which to save generated data. E.g. path_to_save=’data/my_data’
- Returns
Random variable values sampled from dist.
- Return type
numpy array
- trafpy.generator.src.dists.val_dists.gen_skew_data(location, skew, scale, min_val, max_val, num_skew_samples, xlim=None, logscale=False, transparent=True, rand_var_name='Unknown', num_bins=0, round_to_nearest=None, num_decimal_places=2)[source]
Generates and plots skewed data for interactive multimodal distributions.
- Parameters
location (int/float) – Position value of skewed distribution (mean shape parameter).
skew (int/float) – Skew value of skewed distribution (skewness shape parameter).
scale (int/float) – Scale value of skewed distribution (standard deviation
scale – Scale value of skewed distribution (standard deviation shape parameter). shape parameter).
num_skew_samples (int) – Number of random variables to sample from distribution to generate skew data and plot.
xlim (list) – X-axis limits of plot. E.g. xlim=[0,10] to plot random variable values between 0 and 10.
logscale (bool) – Whether or not plot should have logscale x-axis and bins.
transparent (bool) – Whether or not to make plot bins slightly transparent.
rand_var_name (str) – Name of random variable to label plot’s x-axis.
num_bins (int) – Number of bins to use in plot. Default is 0, in which case the number of bins chosen will be automatically selected.
round_to_nearest (int/float) – Value to round random variables to nearest. E.g. if round_to_nearest=0.2, will round each random variable to nearest 0.2.
num_decimal_places (int) – Number of decimal places to random variable values. Need to explicitly state otherwise Python’s floating point arithmetic will cause spurious unique random variable value errors when discretising.
- Returns
Random variable values sampled from distribution.
- Return type
list
- trafpy.generator.src.dists.val_dists.gen_skew_dists(min_val, max_val, num_modes=2, xlim=None, rand_var_name='Unknown', round_to_nearest=None, num_decimal_places=2)[source]
- trafpy.generator.src.dists.val_dists.gen_skewnorm_data(a, loc, scale, num_samples, min_val=None, max_val=None, round_to_nearest=None, num_decimal_places=2, interactive_params=None, logscale=False, transparent=True)[source]
Generates skew data.
- Parameters
a (int/float) – Skewness shape parameter. When a=0, distribution is identical to a normal distribution.
loc (int/float) – Position value of skewed distribution (mean shape parameter).
scale (int/float) – Scale value of skewed distribution (standard deviation shape parameter).
min_val (int/float) – Minimum random variable value.
max_val (int/float) – Maximum random variable value.
num_samples (int) – Number of values to sample from generated distribution to generate skew data.
- Returns
List of random variable values sampled from skewed distribution.
- Return type
list
- trafpy.generator.src.dists.val_dists.gen_skewnorm_val_dist(location, skew, scale, num_skew_samples=150000, min_val=None, max_val=None, return_data=False, xlim=None, plot_fig=False, show_fig=False, logscale=False, path_to_save=None, transparent=True, rand_var_name='Random Variable', num_bins=0, round_to_nearest=None, occurrence_multiplier=10, prob_rand_var_less_than=None, num_decimal_places=2)[source]
Generates a skew norm distribution of random variable values.
- Parameters
location (int/float) – Position value of skewed distribution (mean shape parameter).
skew (int/float) – Skew value of skewed distribution (skewness shape parameter).
scale (int/float) – Scale value of skewed distribution (standard deviation
scale – Scale value of skewed distribution (standard deviation shape parameter). shape parameter).
num_skew_samples (int) – Number of random variables to sample from distribution to generate skew data and plot.
return_data (bool) – from generated distribution.
xlim (list) – X-axis limits of plot. E.g. xlim=[0,10] to plot random variable values between 0 and 10.
plot_fig (bool) – Whether or not to plot fig. If True, will return fig.
show_fig (bool) – Whether or not to plot and show fig. If True, will
logscale (bool) – Whether or not plot should have logscale x-axis and bins.
transparent (bool) – Whether or not to make plot bins slightly transparent.
rand_var_name (str) – Name of random variable to label plot’s x-axis.
num_bins (int) – Number of bins to use in plot. Default is 0, in which case the number of bins chosen will be automatically selected.
round_to_nearest (int/float) – Value to round random variables to nearest. E.g. if round_to_nearest=0.2, will round each random variable to nearest 0.2.
prob_rand_var_less_than (list) – List of values for which to print the probability that a variable sampled randomly from the generated distribution will be less than. This is useful for replicating distributions from the literature. E.g. prob_rand_var_less_than=[3.7,5.8] will return the probability that a randomly chosen variable is less than 3.7 and 5.8 respectively.
occurrence_multiplier (int/float) – When sampling random variables from distribution to create plot and random variable data, use this multiplier to determine number of data points to sample. A higher value will cause the random variable data to match the probability distribution more closely, but will take longer to generate.
num_decimal_places (int) – Number of decimal places to random variable values. Need to explicitly state otherwise Python’s floating point arithmetic will cause spurious unique random variable value errors when discretising.
- Returns
- Tuple containing:
prob_dist (dict): Probability distribution whose key-value pairs are random variable value-probability pairs.
rand_vars (list, optional): Random variable values sampled from the generated probability distribution. To return, set return_data=True.
fig (matplotlib.figure.Figure, optional): Probability density and cumulative distribution function plot. To return, set show_fig=True and/or plot_fig=True.
- Return type
tuple
- trafpy.generator.src.dists.val_dists.gen_uniform_val_dist(min_val, max_val, round_to_nearest=None, num_decimal_places=2, occurrence_multiplier=100, path_to_save=None, plot_fig=False, show_fig=False, return_data=False, xlim=None, logscale=False, rand_var_name='Random Variable', prob_rand_var_less_than=None, num_bins=0, print_data=False)[source]
Generates a uniform distribution of random variable values.
Uniform distributions are the most simple distribution. Each random variable value in a uniform distribution has an equal probability of occurring.
- Parameters
min_val (int/float) – Minimum random variable value.
max_val (int/float) – Maximum random variable value.
round_to_nearest (int/float) – Value to round random variables to nearest. E.g. if round_to_nearest=0.2, will round each random variable to nearest 0.2.
num_decimal_places (int) – Number of decimal places to random variable values. Need to explicitly state otherwise Python’s floating point arithmetic will cause spurious unique random variable value errors when discretising.
occurrence_multiplier (int/float) – When sampling random variables from distribution to create plot and random variable data, use this multiplier to determine number of data points to sample. A higher value will cause the random variable data to match the probability distribution more closely, but will take longer to generate.
path_to_save (str) – Path to directory (with file name included) in which to save generated distribution. E.g. path_to_save=’data/dists/my_dist’.
plot_fig (bool) – Whether or not to plot fig. If True, will return fig.
show_fig (bool) – Whether or not to plot and show fig. If True, will return and display fig.
return_data (bool) – from generated distribution.
xlim (list) – X-axis limits of plot. E.g. xlim=[0,10] to plot random variable values between 0 and 10.
logscale (bool) – Whether or not plot should have logscale x-axis and bins.
rand_var_name (str) – Name of random variable to label plot’s x-axis.
prob_rand_var_less_than (list) – List of values for which to print the probability that a variable sampled randomly from the generated distribution will be less than. This is useful for replicating distributions from the literature. E.g. prob_rand_var_less_than=[3.7,5.8] will return the probability that a randomly chosen variable is less than 3.7 and 5.8 respectively.
num_bins (int) – Number of bins to use in plot. Default is 0, in which case the number of bins chosen will be automatically selected.
print_data (bool) – whether or not to print extra information about the generated data.
- Returns
- Tuple containing:
prob_dist (dict): Probability distribution whose key-value pairs are random variable value-probability pairs.
rand_vars (list, optional): Random variable values sampled from the generated probability distribution. To return, set return_data=True.
fig (matplotlib.figure.Figure, optional): Probability density and cumulative distribution function plot. To return, set show_fig=True and/or plot_fig=True.
- Return type
tuple
- trafpy.generator.src.dists.val_dists.gen_val_dist_data(val_dist, min_val, max_val, num_vals_to_gen, path_to_save=None)[source]
Generates values between min_val and max_val following val_dist distribution
- trafpy.generator.src.dists.val_dists.gen_weibull_dist(_alpha, _lambda, size, round_to_nearest=None, num_decimal_places=2, min_val=None, max_val=None, interactive_params=None, logscale=False, transparent=True)[source]
Generates a Weibull distribution of random variable values.
Weibull distributions often fir scenarios whose random variable values (e.g. ‘time until failure’) are modelled by ‘extreme value theory’ (EVT) in that the values being predicted are more extreme than any previously recorded and, similar to the log-normal distribution, have a low mean but high variance and therefore a long tail/positive skew. Often use to predict time until failure.
E.g. real-world scenarios: Paricle sizes generated by grinding, milling & crushing operations, survival times after cancer diagnosis, light bulb failure times, divorce rates, data centre arrival times, etc.
- Parameters
_alpha (int/float) –
Shape parameter. Describes slope of distribution.
_alpha < 1: Probability of random variable occurring decreases as values get higher. Occurs in systems with high ‘infant mortality’ in that e.g. defective items occur soon after t=0 and are therefore weeded out of the population early on.
_alpha == 1: Special case of the Weibull distribution which reduces the distribution to an exponential distribution.
_alpha > 1: Probability of random variable value occurring increases with time (until peak passes). Occurs in systems with an ‘aging’ process whereby e.g. components are more likely to fail as time goes on.
_lambda (int/float) – Weibull scale parameter. Use to shape distribution standard deviation.
size (int) – Number of random variable values to sample from distribution.
interactive_params (dict) – Dictionary of distribution parameter values (must provide if in interactive mode).
logscale (bool) – Whether or not plot should have logscale x-axis and bins.
transparent (bool) – Whether or not to make plot bins slightly transparent.
- Returns
random variable values generated by sampling from the distribution.
- Return type
list