# API documentation for vaex library¶

Vaex is a library for dealing with big tabular data.

The most important class (datastructure) in vaex is the Dataset. A dataset is obtained by either, opening the example dataset:

>>> import vaex as vx
>>> t = vx.example()


Or opening a file:

>>> t1 = vx.open("somedata.hdf5")
>>> t2 = vx.open("somedata.fits")
>>> t3 = vx.open("somedata.csv")


Or connecting to a remove server:

>>> tbig = vx.open("http://bla.com/bigtable")


The main purpose of vaex is to provide statistics, such as mean, count, sum, standard deviation, per columns, possibly with a selection, and on a regular grid.

To count the number of rows:

>>> t = vx.example()
>>> t.count()
330000.0


Or the number of valid values, which for this dataset is the same:

>>> t.count("x")
330000.0


Count them on a regular grid:

>>> t.count("x", binby=["x", "y"], shape=(4,4))
array([[   902.,   5893.,   5780.,   1193.],
[  4097.,  71445.,  75916.,   4560.],
[  4743.,  71131.,  65560.,   4108.],
[  1115.,   6578.,   4382.,    821.]])


Visualise it using matplotlib:

>>> t.plot("x", "y", show=True)
<matplotlib.image.AxesImage at 0x1165a5090>

vaex.open(path, *args, **kwargs)[source]

Open a dataset from file given by path

Parameters: path (str) – local or absolute path to file args – extra arguments for file readers that need it kwargs – extra keyword arguments return dataset if file is supported, otherwise None Dataset
>>> import vaex as vx
>>> vx.open('myfile.hdf5')
<vaex.dataset.Hdf5MemoryMapped at 0x1136ee3d0>

vaex.server(url, **kwargs)[source]

Connect to hostname supporting the vaex web api

Parameters: Return vaex.dataset.ServerRest: hostname (str) – hostname or ip address of server returns a server object, note that it does not connect to the server yet, so this will always succeed ServerRest
vaex.example(download=True)[source]

Returns an example dataset which comes with vaex for testing/learning purposes

Return type: vaex.dataset.Dataset
vaex.from_arrays(**arrays)[source]

Create an in memory dataset from numpy arrays

Param: str name: name of dataset arrays: keyword arguments with arrays
>>> x = np.arange(10)
>>> y = x ** 2
>>> dataset = vx.from_arrays(x=x, y=y)

vaex.from_pandas(df, name='pandas', copy_index=True, index_name='index')[source]

Create an in memory dataset from a pandas dataframe

Param: pandas.DataFrame df: Pandas dataframe name: unique for the dataset
>>> import pandas as pd
>>> df = pd.from_csv("test.csv")
>>> ds = vx.from_pandas(df, name="test")

vaex.from_ascii(path, seperator=None, names=True, skip_lines=0, skip_after=0, **kwargs)[source]

Create an in memory dataset from an ascii file (whitespace seperated by default).

>>> ds = vx.from_ascii("table.asc")
>>> ds = vx.from_ascii("table.csv", seperator=",", names=["x", "y", "z"])

Parameters: path – file path seperator – value seperator, by default whitespace, use ”,” for comma seperated values. names – If True, the first line is used for the column names, otherwise provide a list of strings with names skip_lines – skip lines at the start of the file skip_after – skip lines at the end of the file kwargs –
vaex.from_samp(username=None, password=None)[source]

Connect to a SAMP Hub and wait for a single table load event, disconnect, download the table and return the dataset

Useful if you want to send a single table from say TOPCAT to vaex in a python console or notebook

vaex.open_many(filenames)[source]

Open a list of filenames, and return a dataset with all datasets cocatenated

Parameters: filenames (list[str]) – list of filenames/paths Dataset
vaex.app(*args, **kwargs)[source]

Create a vaex app, the QApplication mainloop must be started.

In ipython notebook/jupyter do the following: import vaex.ui.main # this causes the qt api level to be set properly import vaex as xs Next cell: %gui qt Next cell app = vx.app()

From now on, you can run the app along with jupyter

vaex.zeldovich(dim=2, N=256, n=-2.5, t=None, scale=1, seed=None)[source]

Creates a zeldovich dataset

vaex.set_log_level_debug()[source]

set log level to debug

vaex.set_log_level_info()[source]

set log level to info

vaex.set_log_level_warning()[source]

set log level to warning

vaex.set_log_level_exception()[source]

set log level to exception

vaex.set_log_level_off()[source]

Disabled logging

## vaex.dataset module¶

class vaex.dataset.Dataset(name, column_names, executor=None)[source]

Bases: object

All datasets are encapsulated in this class, local or remote dataets

Each dataset has a number of columns, and a number of rows, the length of the dataset.

The most common operations are: Dataset.plot >>> >>>

All Datasets have one ‘selection’, and all calculations by Subspace are done on the whole dataset (default) or for the selection. The following example shows how to use the selection.

>>> some_dataset.select("x < 0")
>>> subspace_xy = some_dataset("x", "y")
>>> subspace_xy_selected = subspace_xy.selected()


TODO: active fraction, length and shuffled

add_column(name, f_or_array)[source]

Add an in memory array as a column

add_column_healpix(name='healpix', longitude='ra', latitude='dec', degrees=True, healpix_order=12, nest=True)[source]

Add a healpix (in memory) column based on a longitude and latitude

Parameters: name – Name of column longitude – longitude expression latitude – latitude expression (astronomical convenction latitude=90 is north pole) degrees – If lon/lat are in degrees (default) or radians. healpix_order – healpix order, >= 0 nest – Nested healpix (default) or ring.
add_variable(name, expression, overwrite=True)[source]

Add a variable column to the dataset

Param: str name: name of virtual varible expression: expression for the variable

Variable may refer to other variables, and virtual columns and expression may refer to variables

>>> dataset.add_variable("center")
>>> dataset.select("x_prime < 0")

add_virtual_column(name, expression)[source]

Add a virtual column to the dataset

Example: >>> dataset.add_virtual_column(“r”, “sqrt(x**2 + y**2 + z**2)”) >>> dataset.select(“r < 10”)

Param: str name: name of virtual column expression: expression for the column
add_virtual_column_bearing(name, lon1, lat1, lon2, lat2)[source]
add_virtual_columns_aitoff(alpha, delta, x, y, radians=True)[source]

Parameters: alpha – azimuth angle delta – polar angle x – output name for x coordinate y – output name for y coordinate radians – input and output in radians (True), or degrees (False)
add_virtual_columns_cartesian_to_polar(x='x', y='y', radius_out='r_polar', azimuth_out='phi_polar', cov_matrix_x_y=None, covariance_postfix='_covariance', uncertainty_postfix='_uncertainty', radians=False)[source]

Convert cartesian to polar coordinates

Parameters: x – expression for x y – expression for y radius_out – name for the virtual column for the radius azimuth_out – name for the virtual column for the azimuth angle cov_matrix_x_y – List all convariance values as a double list of expressions, or “full” to guess all entries (which gives an error when values are not found), or “auto” to guess, but allow for missing values covariance_postfix – uncertainty_postfix – radians – if True, azimuth is in radians, defaults to degrees
add_virtual_columns_cartesian_to_spherical(x='x', y='y', z='z', alpha='l', delta='b', distance='distance', radians=False, center=None, center_name='solar_position')[source]

Convert cartesian to spherical coordinates.

Parameters: x – y – z – alpha – delta – name for polar angle, ranges from -90 to 90 (or -pi to pi when radians is True). distance – radians – center – center_name –
add_virtual_columns_cartesian_velocities_to_pmvr(x='x', y='y', z='z', vx='vx', vy='vy', vz='vz', vr='vr', pm_long='pm_long', pm_lat='pm_lat', distance=None)[source]

Concert velocities from a cartesian system to proper motions and radial velocities

TODO: errors

Parameters: x – name of x column (input) y – y z – z vx – vx vy – vy vz – vz vr – name of the column for the radial velocity in the r direction (output) pm_long – name of the column for the proper motion component in the longitude direction (output) pm_lat – name of the column for the proper motion component in the latitude direction, positive points to the north pole (output) distance – Expression for distance, if not given defaults to sqrt(x**2+y**2+z**2), but if this column already exists, passing this expression may lead to a better performance
add_virtual_columns_cartesian_velocities_to_polar(x='x', y='y', vx='vx', radius_polar=None, vy='vy', vr_out='vr_polar', vazimuth_out='vphi_polar', cov_matrix_x_y_vx_vy=None, covariance_postfix='_covariance', uncertainty_postfix='_uncertainty')[source]

Convert cartesian to polar velocities.

Parameters: x – y – vx – radius_polar – Optional expression for the radius, may lead to a better performance when given. vy – vr_out – vazimuth_out – cov_matrix_x_y_vx_vy – covariance_postfix – uncertainty_postfix –
add_virtual_columns_cartesian_velocities_to_spherical(x='x', y='y', z='z', vx='vx', vy='vy', vz='vz', vr='vr', vlong='vlong', vlat='vlat', distance=None)[source]

Concert velocities from a cartesian to a spherical coordinate system

TODO: errors

Parameters: x – name of x column (input) y – y z – z vx – vx vy – vy vz – vz vr – name of the column for the radial velocity in the r direction (output) vlong – name of the column for the velocity component in the longitude direction (output) vlat – name of the column for the velocity component in the latitude direction, positive points to the north pole (output) distance – Expression for distance, if not given defaults to sqrt(x**2+y**2+z**2), but if this column already exists, passing this expression may lead to a better performance
add_virtual_columns_celestial(long_in, lat_in, long_out, lat_out, input=None, output=None, name_prefix='__celestial', radians=False)[source]
add_virtual_columns_distance_from_parallax(parallax='parallax', distance_name='distance', parallax_uncertainty=None, uncertainty_postfix='_uncertainty')[source]

Convert parallax to distance (i.e. 1/parallax)

Parameters: parallax – expression for the parallax, e.g. “parallax” distance_name – name for the virtual column of the distance, e.g. “distance” parallax_uncertainty – expression for the uncertainty on the parallax, e.g. “parallax_error” uncertainty_postfix – distance_name + uncertainty_postfix is the name for the virtual column, e.g. “distance_uncertainty” by default
add_virtual_columns_eq2ecl(long_in='ra', lat_in='dec', long_out='lambda_', lat_out='beta', input=None, output=None, name_prefix='__celestial_eq2ecl', radians=False)[source]

Add ecliptic coordates (long_out, lat_out) from equatorial coordinates.

Parameters: long_in – Name/expression for right ascension lat_in – Name/expression for declination long_out – Output name for lambda coordinate lat_out – Output name for beta coordinate input – output – name_prefix – radians – input and output in radians (True), or degrees (False)
add_virtual_columns_eq2gal(long_in='ra', lat_in='dec', long_out='l', lat_out='b', input=None, output=None, name_prefix='__celestial_eq2gal', radians=False)[source]

Add galactic coordates (long_out, lat_out) from equatorial coordinates.

Parameters: long_in – Name/expression for right ascension lat_in – Name/expression for declination long_out – Output name for galactic longitude lat_out – Output name for galactic latitude input – output – name_prefix – radians – input and output in radians (True), or degrees (False)
add_virtual_columns_equatorial_to_galactic_cartesian(alpha, delta, distance, xname, yname, zname, radians=True, alpha_gp=3.3660329196841534, delta_gp=0.47347728280415174, l_omega=0.57477043300337094)[source]
add_virtual_columns_lbrvr_proper_motion2vcartesian(long_in='l', lat_in='b', distance='distance', pm_long='pm_l', pm_lat='pm_b', vr='vr', vx='vx', vy='vy', vz='vz', cov_matrix_vr_distance_pm_long_pm_lat=None, uncertainty_postfix='_uncertainty', covariance_postfix='_covariance', name_prefix='__lbvr_proper_motion2vcartesian', center_v=(0, 0, 0), center_v_name='solar_motion', radians=False)[source]

Convert radial velocity and galactic proper motions (and positions) to cartesian velocities wrt the center_v

Parameters: long_in – Name/expression for galactic longitude lat_in – Name/expression for galactic latitude distance – Name/expression for heliocentric distance pm_long – Name/expression for the galactic proper motion in latitude direction (pm_l*, so cosine(b) term should be included) pm_lat – Name/expression for the galactic proper motion in longitude direction vr – Name/expression for the radial velocity vx – Output name for the cartesian velocity x-component vy – Output name for the cartesian velocity y-component vz – Output name for the cartesian velocity z-component name_prefix – center_v – Extra motion that should be added, for instance lsr + motion of the sun wrt the galactic restframe center_v_name – radians – input and output in radians (True), or degrees (False)
add_virtual_columns_matrix3d(x, y, z, xnew, ynew, znew, matrix, matrix_name, matrix_is_expression=False)[source]
Parameters: x (str) – name of x column y (str) – z (str) – xnew (str) – name of transformed x column ynew (str) – znew (str) – matrix (list[list]) – 2d array or list, with [row,column] order matrix_name (str) –
add_virtual_columns_proper_motion2vperpendicular(distance='distance', pm_long='pm_l', pm_lat='pm_b', vl='vl', vb='vb', cov_matrix_distance_pm_long_pm_lat=None, uncertainty_postfix='_uncertainty', covariance_postfix='_covariance', radians=False)[source]

Convert proper motion to perpendicular velocities.

Parameters: distance – pm_long – pm_lat – vl – vb – cov_matrix_distance_pm_long_pm_lat – uncertainty_postfix – covariance_postfix – radians –
add_virtual_columns_proper_motion_eq2gal(long_in='ra', lat_in='dec', pm_long='pm_ra', pm_lat='pm_dec', pm_long_out='pm_l', pm_lat_out='pm_b', cov_matrix_alpha_delta_pma_pmd=None, covariance_postfix='_covariance', uncertainty_postfix='_uncertainty', name_prefix='__proper_motion_eq2gal', radians=False)[source]

Transform/rotate proper motions from equatorial to galactic coordinates

Taken from http://arxiv.org/abs/1306.2945

Parameters: long_in – Name/expression for right ascension lat_in – Name/expression for declination pm_long – Proper motion for ra pm_lat – Proper motion for dec pm_long_out – Output name for output proper motion on l direction pm_lat_out – Output name for output proper motion on b direction name_prefix – radians – input and output in radians (True), or degrees (False)
add_virtual_columns_rotation(x, y, xnew, ynew, angle_degrees)[source]

Rotation in 2d

Parameters: x (str) – Name/expression of x column y (str) – idem for y xnew (str) – name of transformed x column ynew (str) – angle_degrees (float) – rotation in degrees, anti clockwise
add_virtual_columns_spherical_to_cartesian(alpha, delta, distance, xname='x', yname='y', zname='z', cov_matrix_alpha_delta_distance=None, covariance_postfix='_covariance', uncertainty_postfix='_uncertainty', center=None, center_name='solar_position', radians=False)[source]

Convert spherical to cartesian coordinates.

Parameters: alpha – delta – polar angle, ranging from the -90 (south pole) to 90 (north pole) distance – radial distance, determines the units of x, y and z xname – yname – zname – cov_matrix_alpha_delta_distance – covariance_postfix – uncertainty_postfix – center – center_name – radians –
byte_size(selection=False)[source]

Return the size in bytes the whole dataset requires (or the selection), respecting the active_fraction

classmethod can_open(path, *args, **kwargs)[source]

Tests if this class can open the file given by path

cat(i1, i2)[source]
close_files()[source]

Close any possible open file handles, the dataset will not be in a usable state afterwards

col

Convenient when working with ipython in combination with small datasets, since this gives tab-completion

Columns can be accesed by there names, which are attributes. The attribues are currently strings, so you cannot do computations with them

>>> ds = vx.example()
>>> ds.plot(ds.col.x, ds.col.y)

column_count()[source]

Returns the number of columns, not counting virtual ones

combinations(expressions_list=None, dimension=2, exclude=None, **kwargs)[source]

Generate a list of combinations for the possible expressions for the given dimension

Parameters: expressions_list – list of list of expressions, where the inner list defines the subspace dimensions – if given, generates a subspace with all possible combinations for that dimension exclude – list of
copy_metadata(other)[source]
correlation(x, y=None, binby=[], limits=None, shape=128, sort=False, sort_key=<ufunc 'absolute'>, selection=False, async=False)[source]

Calculate the correlation coefficient cov[x,y]/(std[x]*std[y]) between and x and y, possible on a grid defined by binby

Examples:

>>> ds.correlation("x**2+y**2+z**2", "-log(-E+1)")
array(0.6366637382215669)
>>> ds.correlation("x**2+y**2+z**2", "-log(-E+1)", binby="Lz", shape=4)
array([ 0.40594394,  0.69868851,  0.61394099,  0.65266318])

Parameters: x – expression or list of expressions, e.g. ‘x’, or [‘x, ‘y’] y – expression or list of expressions, e.g. ‘x’, or [‘x, ‘y’] binby – List of expressions for constructing a binned grid limits – description for the min and max values for the expressions, e.g. ‘minmax’, ‘99.7%’, [0, 10], or a list of, e.g. [[0, 10], [0, 20], ‘minmax’] shape – shape for the array where the statistic is calculated on, if only an integer is given, it is used for all dimensions, e.g. shape=128, shape=[128, 256] selection – Name of selection to use (or True for the ‘default’), or all the data (when selection is None or False) async – Do not return the result, but a proxy for asynchronous calculations (currently only for internal use) progress – A callable that takes one argument (a floating point value between 0 and 1) indicating the progress, calculations are cancelled when this callable returns False Numpy array with the given shape, or a scalar when no binby argument is given, with the statistic
count(expression=None, binby=[], limits=None, shape=128, selection=False, async=False, edges=False, progress=None)[source]

Count the number of non-NaN values (or all, if expression is None or “*”)

Examples:

>>> ds.count()
330000.0
>>> ds.count("*")
330000.0
>>> ds.count("*", binby=["x"], shape=4)
array([  10925.,  155427.,  152007.,   10748.])

Parameters: expression – Expression or column for which to count non-missing values, or None or ‘*’ for counting the rows binby – List of expressions for constructing a binned grid limits – description for the min and max values for the expressions, e.g. ‘minmax’, ‘99.7%’, [0, 10], or a list of, e.g. [[0, 10], [0, 20], ‘minmax’] shape – shape for the array where the statistic is calculated on, if only an integer is given, it is used for all dimensions, e.g. shape=128, shape=[128, 256] selection – Name of selection to use (or True for the ‘default’), or all the data (when selection is None or False) async – Do not return the result, but a proxy for asynchronous calculations (currently only for internal use) progress – A callable that takes one argument (a floating point value between 0 and 1) indicating the progress, calculations are cancelled when this callable returns False edges – Currently for internal use only (it includes nan’s and values outside the limits at borders, nan and 0, smaller than at 1, and larger at -1 Numpy array with the given shape, or a scalar when no binby argument is given, with the statistic
cov(x, y=None, binby=[], limits=None, shape=128, selection=False, async=False)[source]

Calculate the covariance matrix for x and y or more expressions, possible on a grid defined by binby

Either x and y are expressions, e.g:

>>> ds.cov("x", "y")


Or only the x argument is given with a list of expressions, e,g.:

>> ds.cov([“x, “y, “z”])

Examples:

>>> ds.cov("x", "y")
array([[ 53.54521742,  -3.8123135 ],


[ -3.8123135 , 60.62257881]]) >>> ds.cov([“x”, “y”, “z”]) array([[ 53.54521742, -3.8123135 , -0.98260511], [ -3.8123135 , 60.62257881, 1.21381057], [ -0.98260511, 1.21381057, 25.55517638]])

>>> ds.cov("x", "y", binby="E", shape=2)
array([[[  9.74852878e+00,  -3.02004780e-02],


[ -3.02004780e-02, 9.99288215e+00]],

[[ 8.43996546e+01, -6.51984181e+00],

[ -6.51984181e+00, 9.68938284e+01]]])

param x: param y: expression or list of expressions, e.g. ‘x’, or [‘x, ‘y’] if previous argument is not a list, this argument should be given List of expressions for constructing a binned grid description for the min and max values for the expressions, e.g. ‘minmax’, ‘99.7%’, [0, 10], or a list of, e.g. [[0, 10], [0, 20], ‘minmax’] shape for the array where the statistic is calculated on, if only an integer is given, it is used for all dimensions, e.g. shape=128, shape=[128, 256] Name of selection to use (or True for the ‘default’), or all the data (when selection is None or False) Do not return the result, but a proxy for asynchronous calculations (currently only for internal use) Numpy array with the given shape, or a scalar when no binby argument is given, with the statistic, the last dimensions are of shape (2,2)
covar(x, y, binby=[], limits=None, shape=128, selection=False, async=False)[source]

Calculate the covariance cov[x,y] between and x and y, possible on a grid defined by binby

Examples:

>>> ds.covar("x**2+y**2+z**2", "-log(-E+1)")
array(52.69461456005138)
>>> ds.covar("x**2+y**2+z**2", "-log(-E+1)")/(ds.std("x**2+y**2+z**2") * ds.std("-log(-E+1)"))
0.63666373822156686
>>> ds.covar("x**2+y**2+z**2", "-log(-E+1)", binby="Lz", shape=4)
array([ 10.17387143,  51.94954078,  51.24902796,  20.2163929 ])

Parameters: x – expression or list of expressions, e.g. ‘x’, or [‘x, ‘y’] y – expression or list of expressions, e.g. ‘x’, or [‘x, ‘y’] binby – List of expressions for constructing a binned grid limits – description for the min and max values for the expressions, e.g. ‘minmax’, ‘99.7%’, [0, 10], or a list of, e.g. [[0, 10], [0, 20], ‘minmax’] shape – shape for the array where the statistic is calculated on, if only an integer is given, it is used for all dimensions, e.g. shape=128, shape=[128, 256] selection – Name of selection to use (or True for the ‘default’), or all the data (when selection is None or False) async – Do not return the result, but a proxy for asynchronous calculations (currently only for internal use) progress – A callable that takes one argument (a floating point value between 0 and 1) indicating the progress, calculations are cancelled when this callable returns False Numpy array with the given shape, or a scalar when no binby argument is given, with the statistic
delete_variable(name)[source]

Deletes a variable from a dataset

delete_virtual_column(name)[source]

Deletes a virtual column from a dataset

dtype(expression)[source]
evaluate(expression, i1=None, i2=None, out=None, selection=None)[source]

Evaluate an expression, and return a numpy array with the results for the full column or a part of it.

Note that this is not how vaex should be used, since it means a copy of the data needs to fit in memory.

To get partial results, use i1 and i2/

Parameters: expression (str) – Name/expression to evaluate i1 (int) – Start row index, default is the start (0) i2 (int) – End row index, default is the length of the dataset out (ndarray) – Output array, to which the result may be written (may be used to reuse an array, or write to

a memory mapped array) :param selection: selection to apply :return:

evaluate_selection_mask(name='default', i1=None, i2=None, selection=None)[source]
evaluate_variable(name)[source]

Evaluates the variable given by name

full_length()[source]

the full length of the dataset, independant what active_fraction is

get_active_fraction()[source]

Value in the range (0, 1], to work only with a subset of rows

get_active_range()[source]
get_auto_fraction()[source]
get_column_names(virtual=False, hidden=False, strings=False)[source]

Return a list of column names

Parameters: virtual – If True, also return virtual columns hidden – If True, also return hidden columns list of str
get_current_row()[source]

Individual rows can be ‘picked’, this is the index (integer) of the current row, or None there is nothing picked

classmethod get_options(path)[source]
get_private_dir(create=False)[source]

Each datasets has a directory where files are stored for metadata etc

>>> import vaex as vx
>>> ds = vx.example()
>>> ds.get_private_dir()
'/Users/users/breddels/.vaex/datasets/_Users_users_breddels_vaex-testing_data_helmi-dezeeuw-2000-10p.hdf5'

Parameters: create (bool) – is True, it will create the directory if it does not exist
get_selection(name='default')[source]

Get the current selection object (mostly for internal use atm)

get_variable(name)[source]

Returns the variable given by name, it will not evaluate it.

For evaluation, see Dataset.evaluate_variable(), see also Dataset.set_variable()

has_current_row()[source]

Returns True/False is there currently is a picked row

has_selection(name='default')[source]

Returns True of there is a selection

head(n=10)[source]
head_and_tail(n=10)[source]
healpix_count(expression=None, healpix_expression=None, healpix_max_level=12, healpix_level=8, binby=None, limits=None, shape=128, async=False, progress=None, selection=None)[source]

Count non missing value for expression on an array which represents healpix data.

Parameters: expression – Expression or column for which to count non-missing values, or None or ‘*’ for counting the rows healpix_expression – {healpix_max_level} healpix_max_level – {healpix_max_level} healpix_level – {healpix_level} binby – {binby}, these dimension follow the first healpix dimension. limits – {limits} shape – {shape} selection – {selection} async – {async} progress – {progress}
healpix_plot(healpix_expression='source_id/34359738368', healpix_max_level=12, healpix_level=8, what='count(*)', selection=None, grid=None, healpix_input='equatorial', healpix_output='galactic', f=None, colormap='afmhot', grid_limits=None, image_size=800, nest=True, figsize=None, interactive=False, title='', smooth=None, show=False, colorbar=True, rotation=(0, 0, 0))[source]
Parameters: healpix_expression – {healpix_max_level} healpix_max_level – {healpix_max_level} healpix_level – {healpix_level} what – {what} selection – {selection} grid – {grid} healpix_input – Specificy if the healpix index is in “equatorial”, “galactic” or “ecliptic”. healpix_output – Plot in “equatorial”, “galactic” or “ecliptic”. f – function to apply to the data colormap – matplotlib colormap grid_limits – Optional sequence [minvalue, maxvalue] that determine the min and max value that map to the colormap (values below and above these are clipped to the the min/max). (default is [min(f(grid)), max(f(grid))) image_size – size for the image that healpy uses for rendering nest – If the healpix data is in nested (True) or ring (False) figsize – If given, modify the matplotlib figure size. Example (14,9) interactive – (Experimental, uses healpy.mollzoom is True) title – Title of figure smooth – apply gaussian smoothing, in degrees show – Call matplotlib’s show (True) or not (False, defaut) rotation – Rotatate the plot, in format (lon, lat, psi) such that (lon, lat) is the center, and rotate on the screen by angle psi. All angles are degrees.
info(description=True)[source]
is_local()[source]

Returns True if the dataset is a local dataset, False when a remote dataset

label(expression, unit=None, output_unit=None, format='latex_inline')[source]
limits(expression, value=None, square=False, selection=None, async=False)[source]

Calculate the [min, max] range for expression, as described by value, which is ‘99.7%’ by default.

If value is a list of the form [minvalue, maxvalue], it is simply returned, this is for convenience when using mixed forms.

Example:

>>> ds.limits("x")
array([-28.86381927,  28.9261226 ])
>>> ds.limits(["x", "y"])
(array([-28.86381927,  28.9261226 ]), array([-28.60476934,  28.96535249]))
>>> ds.limits(["x", "y"], "minmax")
(array([-128.293991,  271.365997]), array([ -71.5523682,  146.465836 ]))
>>> ds.limits(["x", "y"], ["minmax", "90%"])
(array([-128.293991,  271.365997]), array([-13.37438402,  13.4224423 ]))
>>> ds.limits(["x", "y"], ["minmax", [0, 10]])
(array([-128.293991,  271.365997]), [0, 10])

Parameters: expression – expression or list of expressions, e.g. ‘x’, or [‘x, ‘y’] value – description for the min and max values for the expressions, e.g. ‘minmax’, ‘99.7%’, [0, 10], or a list of, e.g. [[0, 10], [0, 20], ‘minmax’] selection – Name of selection to use (or True for the ‘default’), or all the data (when selection is None or False) async – Do not return the result, but a proxy for asynchronous calculations (currently only for internal use) List in the form [[xmin, xmax], [ymin, ymax], .... ,[zmin, zmax]] or [xmin, xmax] when expression is not a list
limits_percentage(expression, percentage=99.73, square=False, async=False)[source]

Calculate the [min, max] range for expression, containing approximately a percentage of the data as defined by percentage.

The range is symmetric around the median, i.e., for a percentage of 90, this gives the same results as:

>>> ds.limits_percentage("x", 90)
array([-12.35081376,  12.14858052]
>>> ds.percentile_approx("x", 5), ds.percentile_approx("x", 95)
(array([-12.36813152]), array([ 12.13275818]))


NOTE: this value is approximated by calculating the cumulative distribution on a grid. NOTE 2: The values above are not exactly the same, since percentile and limits_percentage do not share the same code

Parameters: expression – expression or list of expressions, e.g. ‘x’, or [‘x, ‘y’] percentage (float) – Value between 0 and 100 async – Do not return the result, but a proxy for asynchronous calculations (currently only for internal use) List in the form [[xmin, xmax], [ymin, ymax], .... ,[zmin, zmax]] or [xmin, xmax] when expression is not a list
map_reduce(map, reduce, arguments, async=False)[source]
max(expression, binby=[], limits=None, shape=128, selection=False, async=False, progress=None)[source]

Calculate the maximum for given expressions, possible on a grid defined by binby

Example:

>>> ds.max("x")
array(271.365997)
>>> ds.max(["x", "y"])
array([ 271.365997,  146.465836])
>>> ds.max("x", binby="x", shape=5, limits=[-10, 10])
array([-6.00010443, -2.00002384,  1.99998057,  5.99983597,  9.99984646])

Parameters: expression – expression or list of expressions, e.g. ‘x’, or [‘x, ‘y’] binby – List of expressions for constructing a binned grid limits – description for the min and max values for the expressions, e.g. ‘minmax’, ‘99.7%’, [0, 10], or a list of, e.g. [[0, 10], [0, 20], ‘minmax’] shape – shape for the array where the statistic is calculated on, if only an integer is given, it is used for all dimensions, e.g. shape=128, shape=[128, 256] selection – Name of selection to use (or True for the ‘default’), or all the data (when selection is None or False) async – Do not return the result, but a proxy for asynchronous calculations (currently only for internal use) progress – A callable that takes one argument (a floating point value between 0 and 1) indicating the progress, calculations are cancelled when this callable returns False Numpy array with the given shape, or a scalar when no binby argument is given, with the statistic, the last dimension is of shape (2)
mean(expression, binby=[], limits=None, shape=128, selection=False, async=False, progress=None)[source]

Calculate the mean for expression, possibly on a grid defined by binby.

Examples:

>>> ds.mean("x")
-0.067131491264005971
>>> ds.mean("(x**2+y**2)**0.5", binby="E", shape=4)
array([  2.43483742,   4.41840721,   8.26742458,  15.53846476])

Parameters: expression – expression or list of expressions, e.g. ‘x’, or [‘x, ‘y’] binby – List of expressions for constructing a binned grid limits – description for the min and max values for the expressions, e.g. ‘minmax’, ‘99.7%’, [0, 10], or a list of, e.g. [[0, 10], [0, 20], ‘minmax’] shape – shape for the array where the statistic is calculated on, if only an integer is given, it is used for all dimensions, e.g. shape=128, shape=[128, 256] selection – Name of selection to use (or True for the ‘default’), or all the data (when selection is None or False) async – Do not return the result, but a proxy for asynchronous calculations (currently only for internal use) progress – A callable that takes one argument (a floating point value between 0 and 1) indicating the progress, calculations are cancelled when this callable returns False Numpy array with the given shape, or a scalar when no binby argument is given, with the statistic
median_approx(expression, percentage=50.0, binby=[], limits=None, shape=128, percentile_shape=16384, percentile_limits='minmax', selection=False, async=False)[source]

Calculate the median , possible on a grid defined by binby

NOTE: this value is approximated by calculating the cumulative distribution on a grid defined by percentile_shape and percentile_limits

Parameters: expression – expression or list of expressions, e.g. ‘x’, or [‘x, ‘y’] binby – List of expressions for constructing a binned grid limits – description for the min and max values for the expressions, e.g. ‘minmax’, ‘99.7%’, [0, 10], or a list of, e.g. [[0, 10], [0, 20], ‘minmax’] shape – shape for the array where the statistic is calculated on, if only an integer is given, it is used for all dimensions, e.g. shape=128, shape=[128, 256] percentile_limits – description for the min and max values to use for the cumulative histogram, should currently only be ‘minmax’ percentile_shape – shape for the array where the cumulative histogram is calculated on, integer type selection – Name of selection to use (or True for the ‘default’), or all the data (when selection is None or False) async – Do not return the result, but a proxy for asynchronous calculations (currently only for internal use) Numpy array with the given shape, or a scalar when no binby argument is given, with the statistic
min(expression, binby=[], limits=None, shape=128, selection=False, async=False, progress=None)[source]

Calculate the minimum for given expressions, possible on a grid defined by binby

Example:

>>> ds.min("x")
array(-128.293991)
>>> ds.min(["x", "y"])
array([-128.293991 ,  -71.5523682])
>>> ds.min("x", binby="x", shape=5, limits=[-10, 10])
array([-9.99919128, -5.99972439, -1.99991322,  2.0000093 ,  6.0004878 ])

Parameters: expression – expression or list of expressions, e.g. ‘x’, or [‘x, ‘y’] binby – List of expressions for constructing a binned grid limits – description for the min and max values for the expressions, e.g. ‘minmax’, ‘99.7%’, [0, 10], or a list of, e.g. [[0, 10], [0, 20], ‘minmax’] shape – shape for the array where the statistic is calculated on, if only an integer is given, it is used for all dimensions, e.g. shape=128, shape=[128, 256] selection – Name of selection to use (or True for the ‘default’), or all the data (when selection is None or False) async – Do not return the result, but a proxy for asynchronous calculations (currently only for internal use) progress – A callable that takes one argument (a floating point value between 0 and 1) indicating the progress, calculations are cancelled when this callable returns False Numpy array with the given shape, or a scalar when no binby argument is given, with the statistic, the last dimension is of shape (2)
minmax(expression, binby=[], limits=None, shape=128, selection=False, async=False, progress=None)[source]

Calculate the minimum and maximum for expressions, possible on a grid defined by binby

Example:

>>> ds.minmax("x")
array([-128.293991,  271.365997])
>>> ds.minmax(["x", "y"])
array([[-128.293991 ,  271.365997 ],
[ -71.5523682,  146.465836 ]])
>>> ds.minmax("x", binby="x", shape=5, limits=[-10, 10])
array([[-9.99919128, -6.00010443],
[-5.99972439, -2.00002384],
[-1.99991322,  1.99998057],
[ 2.0000093 ,  5.99983597],
[ 6.0004878 ,  9.99984646]])

Parameters: expression – expression or list of expressions, e.g. ‘x’, or [‘x, ‘y’] binby – List of expressions for constructing a binned grid limits – description for the min and max values for the expressions, e.g. ‘minmax’, ‘99.7%’, [0, 10], or a list of, e.g. [[0, 10], [0, 20], ‘minmax’] shape – shape for the array where the statistic is calculated on, if only an integer is given, it is used for all dimensions, e.g. shape=128, shape=[128, 256] selection – Name of selection to use (or True for the ‘default’), or all the data (when selection is None or False) async – Do not return the result, but a proxy for asynchronous calculations (currently only for internal use) progress – A callable that takes one argument (a floating point value between 0 and 1) indicating the progress, calculations are cancelled when this callable returns False Numpy array with the given shape, or a scalar when no binby argument is given, with the statistic, the last dimension is of shape (2)
mode(expression, binby=[], limits=None, shape=256, mode_shape=64, mode_limits=None, progressbar=False, selection=None)[source]
mutual_information(x, y=None, mi_limits=None, mi_shape=256, binby=[], limits=None, shape=128, sort=False, selection=False, async=False)[source]

Estimate the mutual information between and x and y on a grid with shape mi_shape and mi_limits, possible on a grid defined by binby

If sort is True, the mutual information is returned in sorted (descending) order and the list of expressions is returned in the same order

Examples:

>>> ds.mutual_information("x", "y")
array(0.1511814526380327)
>>> ds.mutual_information([["x", "y"], ["x", "z"], ["E", "Lz"]])
array([ 0.15118145,  0.18439181,  1.07067379])
>>> ds.mutual_information([["x", "y"], ["x", "z"], ["E", "Lz"]], sort=True)
(array([ 1.07067379,  0.18439181,  0.15118145]),
[['E', 'Lz'], ['x', 'z'], ['x', 'y']])

Parameters: x – expression or list of expressions, e.g. ‘x’, or [‘x, ‘y’] y – expression or list of expressions, e.g. ‘x’, or [‘x, ‘y’] limits – description for the min and max values for the expressions, e.g. ‘minmax’, ‘99.7%’, [0, 10], or a list of, e.g. [[0, 10], [0, 20], ‘minmax’] shape – shape for the array where the statistic is calculated on, if only an integer is given, it is used for all dimensions, e.g. shape=128, shape=[128, 256] binby – List of expressions for constructing a binned grid limits – description for the min and max values for the expressions, e.g. ‘minmax’, ‘99.7%’, [0, 10], or a list of, e.g. [[0, 10], [0, 20], ‘minmax’] shape – shape for the array where the statistic is calculated on, if only an integer is given, it is used for all dimensions, e.g. shape=128, shape=[128, 256] sort – return mutual information in sorted (descending) order, and also return the correspond list of expressions when sorted is True selection – Name of selection to use (or True for the ‘default’), or all the data (when selection is None or False) async – Do not return the result, but a proxy for asynchronous calculations (currently only for internal use) Numpy array with the given shape, or a scalar when no binby argument is given, with the statistic,
classmethod option_to_args(option)[source]
percentile_approx(expression, percentage=50.0, binby=[], limits=None, shape=128, percentile_shape=1024, percentile_limits='minmax', selection=False, async=False)[source]

Calculate the percentile given by percentage, possible on a grid defined by binby

NOTE: this value is approximated by calculating the cumulative distribution on a grid defined by percentile_shape and percentile_limits

>>> ds.percentile_approx("x", 10), ds.percentile_approx("x", 90)
(array([-8.3220355]), array([ 7.92080358]))
>>> ds.percentile_approx("x", 50, binby="x", shape=5, limits=[-10, 10])
array([[-7.56462982],
[-3.61036641],
[-0.01296306],
[ 3.56697863],
[ 7.45838367]])


0:1:0.1 1:1:0.2 2:1:0.3 3:1:0.4 4:1:0.5

5:1:0.6 6:1:0.7 7:1:0.8 8:1:0.9 9:1:1.0

Parameters: expression – expression or list of expressions, e.g. ‘x’, or [‘x, ‘y’] binby – List of expressions for constructing a binned grid limits – description for the min and max values for the expressions, e.g. ‘minmax’, ‘99.7%’, [0, 10], or a list of, e.g. [[0, 10], [0, 20], ‘minmax’] shape – shape for the array where the statistic is calculated on, if only an integer is given, it is used for all dimensions, e.g. shape=128, shape=[128, 256] percentile_limits – description for the min and max values to use for the cumulative histogram, should currently only be ‘minmax’ percentile_shape – shape for the array where the cumulative histogram is calculated on, integer type selection – Name of selection to use (or True for the ‘default’), or all the data (when selection is None or False) async – Do not return the result, but a proxy for asynchronous calculations (currently only for internal use) Numpy array with the given shape, or a scalar when no binby argument is given, with the statistic
plot(x=None, y=None, z=None, what='count(*)', vwhat=None, reduce=['colormap'], f=None, normalize='normalize', normalize_axis='what', vmin=None, vmax=None, shape=256, vshape=32, limits=None, grid=None, colormap='afmhot', figsize=None, xlabel=None, ylabel=None, aspect='auto', tight_layout=True, interpolation='nearest', show=False, colorbar=True, selection=None, selection_labels=None, title=None, background_color='white', pre_blend=False, background_alpha=1.0, visual={'fade': 'selection', 'row': 'subspace', 'layer': 'z', 'column': 'what', 'x': 'x', 'y': 'y'}, smooth_pre=None, smooth_post=None, wrap=True, wrap_columns=4, return_extra=False, hardcopy=None)[source]

Declarative plotting of statistical plots using matplotlib, supports subplots, selections, layers

Instead of passing x and y, pass a list as x argument for multiple panels. Give what a list of options to have multiple panels. When both are present then will be origanized in a column/row order.

This methods creates a 6 dimensional ‘grid’, where each dimension can map the a visual dimension. The grid dimensions are:

• x: shape determined by shape, content by x argument or the first dimension of each space
• y: ,,
• z: related to the z argument
• selection: shape equals length of selection argument
• what: shape equals length of what argument
• space: shape equals length of x argument if multiple values are given

By default, this its shape is (1, 1, 1, 1, shape, shape) (where x is the last dimension)

The visual dimensions are

• x: x coordinate on a plot / image (default maps to grid’s x)
• y: y ,, (default maps to grid’s y)
• layer: each image in this dimension is blended togeher to one image (default maps to z)
• fade: each image is shown faded after the next image (default mapt to selection)
• row: rows of subplots (default maps to space)
• columns: columns of subplot (default maps to what)

All these mappings can be changes by the visual argument, some examples:

>>> ds.plot('x', 'y', what=['mean(x)', 'correlation(vx, vy)'])


Will plot each ‘what’ as a column

>>> ds.plot('x', 'y', selection=['FeH < -3', '(FeH >= -3) & (FeH < -2)'], visual=dict(column='selection'))


Will plot each selection as a column, instead of a faded on top of each other.

Parameters: x – Expression to bin in the x direction (by default maps to x), or list of pairs, like [[‘x’, ‘y’], [‘x’, ‘z’]], if multiple pairs are given, this dimension maps to rows by default y – y (by default maps to y) z – Expression to bin in the z direction, followed by a :start,end,shape signature, like ‘FeH:-3,1:5’ will produce 5 layers between -10 and 10 (by default maps to layer) what – What to plot, count(*) will show a N-d histogram, mean(‘x’), the mean of the x column, sum(‘x’) the sum, std(‘x’) the standard deviation, correlation(‘vx’, ‘vy’) the correlation coefficient. Can also be a list of values, like [‘count(x)’, std(‘vx’)], (by default maps to column) reduce – f – transform values by: ‘identity’ does nothing ‘log’ or ‘log10’ will show the log of the value normalize – normalization function, currently only ‘normalize’ is supported normalize_axis – which axes to normalize on, None means normalize by the global maximum. vmin – instead of automatic normalization, (using normalize and normalization_axis) scale the data between vmin and vmax to [0, 1] vmax – see vmin shape – shape/size of the n-D histogram grid limits – list of [[xmin, xmax], [ymin, ymax]], or a description such as ‘minmax’, ‘99%’ grid – if the binning is done before by yourself, you can pass it colormap – matplotlib colormap to use figsize – (x, y) tuple passed to pylab.figure for setting the figure size xlabel – ylabel – aspect – tight_layout – call pylab.tight_layout or not colorbar – plot a colorbar or not interpolation – interpolation for imshow, possible options are: ‘nearest’, ‘bilinear’, ‘bicubic’, see matplotlib for more return_extra –
plot1d(x=None, what='count(*)', grid=None, shape=64, facet=None, limits=None, figsize=None, f='identity', n=None, normalize_axis=None, xlabel=None, ylabel=None, label=None, selection=None, show=False, tight_layout=True, hardcopy=None, **kwargs)[source]
Parameters: x – Expression to bin in the x direction what – What to plot, count(*) will show a N-d histogram, mean(‘x’), the mean of the x column, sum(‘x’) the sum grid – grid – if the binning is done before by yourself, you can pass it facet – Expression to produce facetted plots ( facet=’x:0,1,12’ will produce 12 plots with x in a range between 0 and 1) limits – list of [xmin, xmax], or a description such as ‘minmax’, ‘99%’ figsize – (x, y) tuple passed to pylab.figure for setting the figure size f – transform values by: ‘identity’ does nothing ‘log’ or ‘log10’ will show the log of the value n – normalization function, currently only ‘normalize’ is supported, or None for no normalization normalize_axis – which axes to normalize on, None means normalize by the global maximum. normalize_axis – xlabel – String for label on x axis (may contain latex) ylabel – Same for y axis kwargs – extra argument passed to pylab.plot tight_layout: call pylab.tight_layout or not
plot3d(x, y, z, vx=None, vy=None, vz=None, vwhat=None, limits=None, grid=None, what='count(*)', shape=128, selection=[None, True], f=None, vcount_limits=None, smooth_pre=None, smooth_post=None, grid_limits=None, normalize='normalize', colormap='afmhot', figure_key=None, fig=None, lighting=True, level=[0.1, 0.5, 0.9], opacity=[0.01, 0.05, 0.1], level_width=0.1, show=True, **kwargs)[source]

Use at own risk, requires ipyvolume

plot_bq(x, y, grid=None, shape=256, limits=None, what='count(*)', figsize=None, f='identity', figure_key=None, fig=None, axes=None, xlabel=None, ylabel=None, title=None, show=True, selection=[None, True], colormap='afmhot', grid_limits=None, normalize='normalize', grid_before=None, what_kwargs={}, type='default', scales=None, tool_select=False, bq_cleanup=True, **kwargs)[source]
remove_virtual_meta()[source]

Removes the file with the virtual column etc, it does not change the current virtual columns etc

rename_column(name, new_name)[source]

Renames a column, not this is only the in memory name, this will not be reflected on disk

scatter(x, y, xerr=None, yerr=None, s_expr=None, c_expr=None, selection=None, length_limit=50000, length_check=True, xlabel=None, ylabel=None, errorbar_kwargs={}, **kwargs)[source]

Convenience wrapper around pylab.scatter when for working with small datasets or selections

Parameters: x – Expression for x axis y – Idem for y s_expr – When given, use if for the s (size) argument of pylab.scatter c_expr – When given, use if for the c (color) argument of pylab.scatter selection – Single selection expression, or None length_limit – maximum number of rows it will plot length_check – should we do the maximum row check or not? xlabel – label for x axis, if None .label(x) is used ylabel – label for y axis, if None .label(y) is used errorbar_kwargs – extra dict with arguments passed to plt.errorbar kwargs – extra arguments passed to pylab.scatter
select(boolean_expression, mode='replace', name='default', executor=None)[source]

Perform a selection, defined by the boolean expression, and combined with the previous selection using the given mode

Selections are recorded in a history tree, per name, undo/redo can be done for them seperately

Parameters: boolean_expression (str) – Any valid column expression, with comparison operators mode (str) – Possible boolean operator: replace/and/or/xor/subtract name (str) – history tree or selection ‘slot’ to use executor –
select_box(spaces, limits, mode='replace')[source]

Select a n-dimensional rectangular box bounded by limits

The following examples are equivalent: >>> ds.select_box([‘x’, ‘y’], [(0, 10), (0, 1)]) >>> ds.select_rectangle(‘x’, ‘y’, [(0, 10), (0, 1)]) :param spaces: list of expressions :param limits: sequence of shape [(x1, x2), (y1, y2)] :param mode: :return:

select_inverse(name='default', executor=None)[source]

Invert the selection, i.e. what is selected will not be, and vice versa

Parameters: name (str) – executor –
select_lasso(expression_x, expression_y, xsequence, ysequence, mode='replace', name='default', executor=None)[source]

For performance reasons, a lasso selection is handled differently.

Parameters: expression_x (str) – Name/expression for the x coordinate expression_y (str) – Name/expression for the y coordinate xsequence – list of x numbers defining the lasso, together with y ysequence – mode (str) – Possible boolean operator: replace/and/or/xor/subtract name (str) – executor –
select_nothing(name='default')[source]

Select nothing

select_rectangle(x, y, limits, mode='replace')[source]

Select a 2d rectangular box in the space given by x and y, bounds by limits

Example: >>> ds.select_box(‘x’, ‘y’, [(0, 10), (0, 1)])

Parameters: x – expression for the x space y – expression fo the y space limits – sequence of shape [(x1, x2), (y1, y2)] mode –
selected_length()[source]

Returns the number of rows that are selected

selection_can_redo(name='default')[source]

Can selection name be redone?

selection_can_undo(name='default')[source]

Can selection name be undone?

selection_favorite_add(name, selection_name='default')[source]
selection_favorite_apply(name, selection_name='default', executor=None)[source]
selection_favorite_remove(name)[source]
selection_redo(name='default', executor=None)[source]

Redo selection, for the name

selection_undo(name='default', executor=None)[source]

Undo selection, for the name

selections_favorite_load()[source]
selections_favorite_store()[source]
set_active_fraction(value)[source]

Sets the active_fraction, set picked row to None, and remove selection

TODO: we may be able to keep the selection, if we keep the expression, and also the picked row

set_active_range(i1, i2)[source]

Sets the active_fraction, set picked row to None, and remove selection

TODO: we may be able to keep the selection, if we keep the expression, and also the picked row

set_auto_fraction(enabled)[source]
set_current_row(value)[source]

Set the current row, and emit the signal signal_pick

set_selection(selection, name='default', executor=None)[source]

Sets the selection object

Parameters: selection – Selection object name – selection ‘slot’ executor –
set_variable(name, expression_or_value, write=True)[source]

Set the variable to an expression or value defined by expression_or_value

>>> ds.set_variable("a", 2.)
>>> ds.set_variable("b", "a**2")
>>> ds.get_variable("b")
'a**2'
>>> ds.evaluate_variable("b")
4.0

Parameters: name – Name of the variable write – write variable to meta file expression – value or expression
std(expression, binby=[], limits=None, shape=128, selection=False, async=False, progress=None)[source]

Calculate the standard deviation for the given expression, possible on a grid defined by binby

>>> ds.std("vz")
110.31773397535071
>>> ds.std("vz", binby=["(x**2+y**2)**0.5"], shape=4)
array([ 123.57954851,   85.35190177,   61.14345748,   38.0740619 ])

Parameters: expression – expression or list of expressions, e.g. ‘x’, or [‘x, ‘y’] binby – List of expressions for constructing a binned grid limits – description for the min and max values for the expressions, e.g. ‘minmax’, ‘99.7%’, [0, 10], or a list of, e.g. [[0, 10], [0, 20], ‘minmax’] shape – shape for the array where the statistic is calculated on, if only an integer is given, it is used for all dimensions, e.g. shape=128, shape=[128, 256] selection – Name of selection to use (or True for the ‘default’), or all the data (when selection is None or False) async – Do not return the result, but a proxy for asynchronous calculations (currently only for internal use) progress – A callable that takes one argument (a floating point value between 0 and 1) indicating the progress, calculations are cancelled when this callable returns False Numpy array with the given shape, or a scalar when no binby argument is given, with the statistic
subspace(*expressions, **kwargs)[source]

Return a Subspace for this dataset with the given expressions:

Example:

>>> subspace_xy = some_dataset("x", "y")

Return type: Subspace expressions (list[str]) – list of expressions kwargs –
subspaces(expressions_list=None, dimensions=None, exclude=None, **kwargs)[source]

Generate a Subspaces object, based on a custom list of expressions or all possible combinations based on dimension

Parameters: expressions_list – list of list of expressions, where the inner list defines the subspace dimensions – if given, generates a subspace with all possible combinations for that dimension exclude – list of
sum(expression, binby=[], limits=None, shape=128, selection=False, async=False, progress=None)[source]

Calculate the sum for the given expression, possible on a grid defined by binby

Examples:

>>> ds.sum("L")
304054882.49378014
>>> ds.sum("L", binby="E", shape=4)
array([  8.83517994e+06,   5.92217598e+07,   9.55218726e+07,
1.40008776e+08])

Parameters: expression – expression or list of expressions, e.g. ‘x’, or [‘x, ‘y’] binby – List of expressions for constructing a binned grid limits – description for the min and max values for the expressions, e.g. ‘minmax’, ‘99.7%’, [0, 10], or a list of, e.g. [[0, 10], [0, 20], ‘minmax’] shape – shape for the array where the statistic is calculated on, if only an integer is given, it is used for all dimensions, e.g. shape=128, shape=[128, 256] selection – Name of selection to use (or True for the ‘default’), or all the data (when selection is None or False) async – Do not return the result, but a proxy for asynchronous calculations (currently only for internal use) progress – A callable that takes one argument (a floating point value between 0 and 1) indicating the progress, calculations are cancelled when this callable returns False Numpy array with the given shape, or a scalar when no binby argument is given, with the statistic
tail(n=10)[source]
to_astropy_table(column_names=None, selection=None, strings=True, virtual=False, index=None)[source]

Returns a astropy table object containing the ndarrays corresponding to the evaluated data

Parameters: column_names – list of column names, to export, when None Dataset.get_column_names(strings=strings, virtual=virtual) is used selection – Name of selection to use (or True for the ‘default’), or all the data (when selection is None or False) strings – argument passed to Dataset.get_column_names when column_names is None virtual – argument passed to Dataset.get_column_names when column_names is None index – if this column is given it is used for the index of the DataFrame astropy.table.Table object
to_copy(column_names=None, selection=None, strings=True, virtual=False)[source]

Return a copy of the Dataset, if selection is None, it does not copy the data, it just has a reference

Parameters: column_names – list of column names, to copy, when None Dataset.get_column_names(strings=strings, virtual=virtual) is used selection – Name of selection to use (or True for the ‘default’), or all the data (when selection is None or False) strings – argument passed to Dataset.get_column_names when column_names is None virtual – argument passed to Dataset.get_column_names when column_names is None dict
to_dict(column_names=None, selection=None, strings=True, virtual=False)[source]

Return a dict containing the ndarray corresponding to the evaluated data

Parameters: column_names – list of column names, to export, when None Dataset.get_column_names(strings=strings, virtual=virtual) is used selection – Name of selection to use (or True for the ‘default’), or all the data (when selection is None or False) strings – argument passed to Dataset.get_column_names when column_names is None virtual – argument passed to Dataset.get_column_names when column_names is None dict
to_items(column_names=None, selection=None, strings=True, virtual=False)[source]

Return a list of [(column_name, ndarray), ...)] pairs where the ndarray corresponds to the evaluated data

Parameters: column_names – list of column names, to export, when None Dataset.get_column_names(strings=strings, virtual=virtual) is used selection – Name of selection to use (or True for the ‘default’), or all the data (when selection is None or False) strings – argument passed to Dataset.get_column_names when column_names is None virtual – argument passed to Dataset.get_column_names when column_names is None list of (name, ndarray) pairs
to_pandas_df(column_names=None, selection=None, strings=True, virtual=False, index_name=None)[source]

Return a pandas DataFrame containing the ndarray corresponding to the evaluated data

If index is given, that column is used for the index of the dataframe.

>>> df = ds.to_pandas_df(["x", "y", "z"])
>>> ds_copy = vx.from_pandas(df)

Parameters: column_names – list of column names, to export, when None Dataset.get_column_names(strings=strings, virtual=virtual) is used selection – Name of selection to use (or True for the ‘default’), or all the data (when selection is None or False) strings – argument passed to Dataset.get_column_names when column_names is None virtual – argument passed to Dataset.get_column_names when column_names is None index_column – if this column is given it is used for the index of the DataFrame pandas.DataFrame object
ucd_find(ucds, exclude=[])[source]

Find a set of columns (names) which have the ucd, or part of the ucd

Prefixed with a ^, it will only match the first part of the ucd

>>> dataset.ucd_find('pos.eq.ra', 'pos.eq.dec')
['RA', 'DEC']
>>> dataset.ucd_find('pos.eq.ra', 'doesnotexist')
>>> dataset.ucds[dataset.ucd_find('pos.eq.ra')]
'pos.eq.ra;meta.main'
>>> dataset.ucd_find('meta.main')]
'dec'
>>> dataset.ucd_find('^meta.main')]
>>>

unit(expression, default=None)[source]

Returns the unit (an astropy.unit.Units object) for the expression

>>> import vaex as vx
>>> ds = vx.example()
>>> ds.unit("x")
Unit("kpc")
>>> ds.unit("x*L")
Unit("km kpc2 / s")

Parameters: expression – Expression, which can be a column name default – if no unit is known, it will return this The resulting unit of the expression astropy.units.Unit
update_meta()[source]

Will read back the ucd, descriptions, units etc, written by Dataset.write_meta(). This will be done when opening a dataset.

update_virtual_meta()[source]

Will read back the virtual column etc, written by Dataset.write_virtual_meta(). This will be done when opening a dataset.

validate_expression(expression)[source]

Validate an expression (may throw Exceptions)

var(expression, binby=[], limits=None, shape=128, selection=False, async=False, progress=None)[source]

Calculate the sample variance for the given expression, possible on a grid defined by binby

Examples:

>>> ds.var("vz")
12170.002429456246
>>> ds.var("vz", binby=["(x**2+y**2)**0.5"], shape=4)
array([ 15271.90481083,   7284.94713504,   3738.52239232,   1449.63418988])
>>> ds.var("vz", binby=["(x**2+y**2)**0.5"], shape=4)**0.5
array([ 123.57954851,   85.35190177,   61.14345748,   38.0740619 ])
>>> ds.std("vz", binby=["(x**2+y**2)**0.5"], shape=4)
array([ 123.57954851,   85.35190177,   61.14345748,   38.0740619 ])

Parameters: expression – expression or list of expressions, e.g. ‘x’, or [‘x, ‘y’] binby – List of expressions for constructing a binned grid limits – description for the min and max values for the expressions, e.g. ‘minmax’, ‘99.7%’, [0, 10], or a list of, e.g. [[0, 10], [0, 20], ‘minmax’] shape – shape for the array where the statistic is calculated on, if only an integer is given, it is used for all dimensions, e.g. shape=128, shape=[128, 256] selection – Name of selection to use (or True for the ‘default’), or all the data (when selection is None or False) async – Do not return the result, but a proxy for asynchronous calculations (currently only for internal use) progress – A callable that takes one argument (a floating point value between 0 and 1) indicating the progress, calculations are cancelled when this callable returns False Numpy array with the given shape, or a scalar when no binby argument is given, with the statistic
write_meta()[source]

Writes all meta data, ucd,description and units

The default implementation is to write this to a file called meta.yaml in the directory defined by Dataset.get_private_dir(). Other implementation may store this in the dataset file itself. (For instance the vaex hdf5 implementation does this)

This method is called after virtual columns or variables are added. Upon opening a file, Dataset.update_meta() is called, so that the information is not lost between sessions.

Note: opening a dataset twice may result in corruption of this file.

write_virtual_meta()[source]

Writes virtual columns, variables and their ucd,description and units

The default implementation is to write this to a file called virtual_meta.yaml in the directory defined by Dataset.get_private_dir(). Other implementation may store this in the dataset file itself.

This method is called after virtual columns or variables are added. Upon opening a file, Dataset.update_virtual_meta() is called, so that the information is not lost between sessions.

Note: opening a dataset twice may result in corruption of this file.

class vaex.dataset.DatasetLocal(name, path, column_names)[source]

Base class for datasets that work with local file/data

compare(other, report_missing=True, report_difference=False, show=10, orderby=None, column_names=None)[source]

Compare two datasets and report their difference, use with care for large datasets

concat(other)[source]

Concatenates two datasets, adding the rows of one the other dataset to the current, returned in a new dataset.

No copy of the data is made.

Parameters: other – The other dataset that is concatenated with this dataset New dataset with the rows concatenated DatasetConcatenated
data

Convenient when working with IPython in combination with small datasets, since this gives tab-completion. Only real columns (i.e. no virtual) columns can be accessed, for getting the data from virtual columns, use Dataset.evalulate(...)

Columns can be accesed by there names, which are attributes. The attribues are of type numpy.ndarray

>>> ds = vx.example()
>>> r = np.sqrt(ds.data.x**2 + ds.data.y**2)

echo(arg)[source]
evaluate(expression, i1=None, i2=None, out=None, selection=None)[source]

The local implementation of Dataset.evaluate()

export_fits(path, column_names=None, shuffle=False, selection=False, progress=None, virtual=False)[source]

Exports the dataset to a fits file that is compatible with TOPCAT colfits format

Parameters: dataset (DatasetLocal) – dataset to export path (str) – path for file column_names (lis[str]) – list of column names to export or None for all columns shuffle (bool) – export rows in random order selection (bool) – export selection or not progress – progress callback that gets a progress fraction as argument and should return True to continue, or a default progress bar when progress=True bool virtual: When True, export virtual columns
export_hdf5(path, column_names=None, byteorder='=', shuffle=False, selection=False, progress=None, virtual=False)[source]

Exports the dataset to a vaex hdf5 file

Parameters: dataset (DatasetLocal) – dataset to export path (str) – path for file column_names (lis[str]) – list of column names to export or None for all columns byteorder (str) – = for native, < for little endian and > for big endian shuffle (bool) – export rows in random order selection (bool) – export selection or not progress – progress callback that gets a progress fraction as argument and should return True to continue, or a default progress bar when progress=True bool virtual: When True, export virtual columns
is_local()[source]

The local implementation of Dataset.evaluate(), always returns True

length(selection=False)[source]

Get the length of the datasets, for the selection of the whole dataset.

If selection is False, it returns len(dataset)

TODO: Implement this in DatasetRemote, and move the method up in Dataset.length()

Parameters: selection – When True, will return the number of selected rows
selected_length(selection='default')[source]

The local implementation of Dataset.selected_length()

shallow_copy(virtual=True, variables=True)[source]

Creates a (shallow) copy of the dataset

It will link to the same data, but will have its own state, e.g. virtual columns, variables, selection etc

class vaex.dataset.DatasetConcatenated(datasets, name=None)[source]

Represents a set of datasets all concatenated. See DatasetLocal.concat() for usage.

class vaex.dataset.DatasetArrays(name='arrays')[source]

Represent an in-memory dataset of numpy arrays, see from_arrays() for usage.

add_column(name, data)[source]

Add a column to the dataset

Parameters: name (str) – name of column data – numpy array with the data

## vaex.events module¶

class vaex.events.Signal(name=None)[source]

Bases: object

connect(callback, prepend=False, *args, **kwargs)[source]
disconnect(callback)[source]
emit(*args, **kwargs)[source]

## vaex.execution module¶

class vaex.execution.Column[source]
needs_copy()[source]
class vaex.execution.Executor(thread_pool=None, buffer_size=None, thread_mover=None)[source]

Bases: object

execute()[source]
execute_threaded()[source]
run(task)[source]
schedule(task)[source]
class vaex.execution.Job(task, order)[source]

Bases: object

exception vaex.execution.UserAbort(reason)[source]

Bases: Exception

## vaex.grids module¶

class vaex.grids.GridScope(locals=None, globals=None)[source]

Bases: object

add_lazy(key, f)[source]
cumulative(array, normalize=True)[source]
disjoined()[source]
evaluate(expression)[source]
marginal2d(i, j)[source]
normalize(array)[source]
setter(key)[source]
slice(slice)[source]
vaex.grids.add_mem(bytes, *info)[source]
vaex.grids.dog(grid, sigma1, sigma2)[source]
vaex.grids.gf(grid, sigma, **kwargs)[source]
vaex.grids.grid_average(scope, counts_name='counts', weighted_name='weighted')[source]

## vaex.kld module¶

class vaex.kld.KlDivergenceShuffle(dataset, pairs, gridsize=128)[source]

Bases: object

get_jobs()[source]
vaex.kld.kl_divergence(P, Q, axis=None)[source]
vaex.kld.kld_shuffled(columns, Ngrid=128, datamins=None, datamaxes=None, offset=1)[source]
vaex.kld.kld_shuffled_grouped(dataset, range_map, pairs, feedback=None, size_grid=32, use_mask=True, bytes_max=536870912)[source]
vaex.kld.mutual_information(data)[source]
vaex.kld.to_disjoined(counts)[source]

class vaex.multithreading.MiniJob(callable, queue_out, args)[source]

Bases: object

cancel()[source]
class vaex.multithreading.ThreadPool(nthreads=8)[source]

Bases: object

close()[source]
execute(index)[source]
run_blocks(callable, total_length)[source]
run_parallel(callable, args_list=[])[source]
class vaex.multithreading.ThreadPoolIndex(nthreads=None)[source]

Bases: object

close()[source]
execute(index)[source]
map(callable, iterator, on_error=None, progress=None, cancel=None)[source]
run_blocks(callabble, total_length, parts=10, on_error=None)[source]
vaex.multithreading.get_main_pool()[source]

## vaex.remote module¶

class vaex.remote.DatasetRemote(name, server, column_names)[source]
class vaex.remote.DatasetRest(server, name, column_names, dtypes, ucds, descriptions, units, description, full_length, virtual_columns=None)[source]
correlation(x, y=None, binby=[], limits=None, shape=128, sort=False, sort_key=<ufunc 'absolute'>, selection=False, async=False, progress=None)[source]
count(expression=None, binby=[], limits=None, shape=128, selection=False, async=False, progress=None)[source]
cov(x, y=None, binby=[], limits=None, shape=128, selection=False, async=False, progress=None)[source]
dtype(expression)[source]
evaluate(expression, i1=None, i2=None, out=None, selection=None, async=False)[source]

basic support for evaluate at server, at least to run some unittest, do not expect this to work from strings

is_local()[source]
mean(expression, binby=[], limits=None, shape=128, selection=False, async=False, progress=None)[source]
minmax(expression, binby=[], limits=None, shape=128, selection=False, async=False, progress=None)[source]
sum(expression, binby=[], limits=None, shape=128, selection=False, async=False, progress=None)[source]
var(expression, binby=[], limits=None, shape=128, selection=False, async=False, progress=None)[source]
class vaex.remote.ServerExecutor[source]

Bases: object

execute()[source]
class vaex.remote.ServerRest(hostname, port=5000, base_path='/', background=False, thread_mover=None, websocket=True)[source]

Bases: object

close()[source]
datasets(as_dict=False, async=False)[source]
submit_http(path, arguments, post_process, async, progress=None, **kwargs)[source]
submit_websocket(path, arguments, async=False, progress=None, post_process=<function ServerRest.<lambda>>)[source]
wait()[source]
class vaex.remote.SubspaceRemote(dataset, expressions, executor, async, masked=False)[source]

Bases: vaex.legacy.Subspace

correlation(means=None, vars=None)[source]
dimension
histogram(limits, size=256, weight=None)[source]
limits_sigma(sigmas=3, square=False)[source]
mean()[source]
minmax()[source]
mutual_information(limits=None, size=256)[source]
nearest(point, metric=None)[source]
sleep(seconds, async=False)[source]
sum()[source]
toarray(list)[source]
var(means=None)[source]
class vaex.remote.TaskServer(post_process, async)[source]

Bases: vaex.dataset.Task

execute()[source]
schedule(task)[source]
vaex.remote.listify(value)[source]
vaex.remote.wrap_future_with_promise(future)[source]

## vaex.samp module¶

class vaex.samp.Samp(daemon=True, name=None)[source]

Bases: object

class vaex.samp.SampSingle(name='vaex - single table load')[source]

Bases: object

wait_for_table()[source]
vaex.samp.ask_cmd_line(username, password)[source]
vaex.samp.fetch_votable(url, username=None, password=None, ask=<function ask_cmd_line>)[source]
vaex.samp.single_table(username=None, password=None)[source]

## vaex.settings module¶

class vaex.settings.AutoStoreDict(settings, store)[source]

Bases: collections.abc.MutableMapping

class vaex.settings.Files(open, recent)[source]

Bases: object

class vaex.settings.Settings(filename)[source]

Bases: object

auto_store_dict(key)[source]
dump()[source]
get(key, default=None)[source]
store(key, value)[source]

## vaex.utils module¶

class vaex.utils.AttrDict(*args, **kwargs)[source]

Bases: dict

class vaex.utils.CpuUsage(format='CPU Usage: %(cpu_usage)s%%', usage_format='% 5d')[source]

Bases: progressbar.widgets.FormatWidgetMixin, progressbar.widgets.TimeSensitiveWidgetBase

class vaex.utils.Timer(name=None, logger=None)[source]

Bases: object

vaex.utils.check_memory_usage(bytes_needed, confirm)[source]
vaex.utils.confirm_on_console(topic, msg)[source]
vaex.utils.dict_constructor(loader, node)[source]
vaex.utils.dict_representer(dumper, data)[source]
vaex.utils.disjoined(data)[source]
vaex.utils.ensure_string(string_or_bytes, encoding='utf-8')[source]
vaex.utils.filename_shorten(path, max_length=150)[source]
vaex.utils.filesize_format(value)[source]
vaex.utils.get_data_file(filename)[source]
vaex.utils.get_private_dir(subdir=None)[source]
vaex.utils.get_root_path()[source]
vaex.utils.linspace_centers(start, stop, N)[source]
vaex.utils.listify(*args)[source]
vaex.utils.make_list(sequence)[source]
vaex.utils.multisum(a, axes)[source]
vaex.utils.os_open(document)[source]

Open document by the default handler of the OS, could be a url opened by a browser, a text file by an editor etc

vaex.utils.progressbar(name='processing', max_value=1)[source]
vaex.utils.progressbar_callable(name='processing', max_value=1)[source]
vaex.utils.progressbars(f=True, next=None, name=None)[source]
vaex.utils.read_json_or_yaml(filename)[source]
vaex.utils.subdivide(length, parts=None, max_length=None)[source]

Generates a list with start end stop indices of length parts, [(0, length/parts), ..., (.., length)]

vaex.utils.submit_subdivide(thread_count, f, length, max_length)[source]
vaex.utils.unlistify(waslist, *args)[source]
vaex.utils.write_json_or_yaml(filename, data)[source]
vaex.utils.yaml_dump(f, data)[source]
vaex.utils.yaml_load(f)[source]