pandas-td Compatibility¶

Compatible Functions¶

pytd.pandas_td.connect(apikey=None, endpoint=None, **kwargs)[source]¶

pytd.pandas_td.create_engine(url, con=None, header=True, show_progress=5.0, clear_progress=True)[source]¶

Create a handler for query engine based on a URL.

The following environment variables are used for default connection:

TD_API_KEY API key TD_API_SERVER API server (default: https://api.treasuredata.com)

Parameters

urlstring: Engine descriptor in the form “type://apikey@host/database?params…” Use shorthand notation “type:database?params…” for the default connection. pytd: “params” will be ignored since pytd.QueryEngine does not have any extra parameters.
conpytd.Client, optional: Handler returned by pytd.pandas_td.connect. If not given, default client is used.
headerstring or boolean, default: True: Prepend comment strings, in the form “– comment”, as a header of queries. Set False to disable header.
show_progressdouble or boolean, default: 5.0: Number of seconds to wait before printing progress. Set False to disable progress entirely. pytd: This argument will be ignored.
clear_progressboolean, default: True: If True, clear progress when query completed. pytd: This argument will be ignored.

Returns

QueryEngine

pytd.pandas_td.read_td_query(query, engine, index_col=None, parse_dates=None, distributed_join=False, params=None)[source]¶

Read Treasure Data query into a DataFrame.

Returns a DataFrame corresponding to the result set of the query string. Optionally provide an index_col parameter to use one of the columns as the index, otherwise default integer index will be used.

Parameters

querystring

Query string to be executed.

engineQueryEngine

Handler returned by create_engine.

index_colstring, optional

Column name to use as index for the returned DataFrame object.

parse_dateslist or dict, optional

List of column names to parse as dates
Dict of {column_name: format string} where format string is strftime compatible in case of parsing string times or is one of (D, s, ns, ms, us) in case of parsing integer timestamps

distributed_joinboolean, default: False

(Presto only) If True, distributed join is enabled. If False, broadcast join is used. See https://prestodb.io/docs/current/release/release-0.77.html

paramsdict, optional

Parameters to pass to execute method. pytd does not support parameter type (‘hive’, ‘presto’), and query type needs to be defined by engine.

Available parameters:

db (str): use the database
result_url (str): result output URL
priority (int or str): priority
- -2: “VERY LOW”
- -1: “LOW”
- 0: “NORMAL”
- 1: “HIGH”
- 2: “VERY HIGH”
retry_limit (int): max number of automatic retries
wait_interval (int): sleep interval until job finish
wait_callback (function): called every interval against job itself

Returns

DataFrame

pytd.pandas_td.read_td_job(job_id, engine, index_col=None, parse_dates=None)[source]¶

Read Treasure Data job result into a DataFrame.

Returns a DataFrame corresponding to the result set of the job. This method waits for job completion if the specified job is still running. Optionally provide an index_col parameter to use one of the columns as the index, otherwise default integer index will be used.

Parameters

job_idinteger

Job ID.

engineQueryEngine

Handler returned by create_engine.

index_colstring, optional

Column name to use as index for the returned DataFrame object.

parse_dateslist or dict, optional

List of column names to parse as dates
Dict of {column_name: format string} where format string is strftime compatible in case of parsing string times or is one of (D, s, ns, ms, us) in case of parsing integer timestamps

Returns

DataFrame

pytd.pandas_td.read_td_table(table_name, engine, index_col=None, parse_dates=None, columns=None, time_range=None, limit=10000)[source]¶

Read Treasure Data table into a DataFrame.

The number of returned rows is limited by “limit” (default 10,000). Setting limit=None means all rows. Be careful when you set limit=None because your table might be very large and the result does not fit into memory.

Parameters

table_namestring

Name of Treasure Data table in database.

engineQueryEngine

Handler returned by create_engine.

index_colstring, optional

Column name to use as index for the returned DataFrame object.

parse_dateslist or dict, optional

List of column names to parse as dates
Dict of {column_name: format string} where format string is strftime compatible in case of parsing string times or is one of (D, s, ns, ms, us) in case of parsing integer timestamps

columnslist, optional

List of column names to select from table.

time_rangetuple (start, end), optional

Limit time range to select. “start” and “end” are one of None, integers, strings or datetime objects. “end” is exclusive, not included in the result.

limitint, default: 10,000

Maximum number of rows to select.

Returns

DataFrame

pytd.pandas_td.read_td(query, engine, index_col=None, parse_dates=None, distributed_join=False, params=None)[source]¶

Read Treasure Data query into a DataFrame.

Returns a DataFrame corresponding to the result set of the query string. Optionally provide an index_col parameter to use one of the columns as the index, otherwise default integer index will be used.

Parameters

querystring

Query string to be executed.

engineQueryEngine

Handler returned by create_engine.

index_colstring, optional

Column name to use as index for the returned DataFrame object.

parse_dateslist or dict, optional

List of column names to parse as dates
Dict of {column_name: format string} where format string is strftime compatible in case of parsing string times or is one of (D, s, ns, ms, us) in case of parsing integer timestamps

distributed_joinboolean, default: False

(Presto only) If True, distributed join is enabled. If False, broadcast join is used. See https://prestodb.io/docs/current/release/release-0.77.html

paramsdict, optional

Parameters to pass to execute method. pytd does not support parameter type (‘hive’, ‘presto’), and query type needs to be defined by engine.

Available parameters:

db (str): use the database
result_url (str): result output URL
priority (int or str): priority
- -2: “VERY LOW”
- -1: “LOW”
- 0: “NORMAL”
- 1: “HIGH”
- 2: “VERY HIGH”
retry_limit (int): max number of automatic retries
wait_interval (int): sleep interval until job finish
wait_callback (function): called every interval against job itself

Returns

DataFrame

pytd.pandas_td.to_td(frame, name, con, if_exists='fail', time_col=None, time_index=None, index=True, index_label=None, chunksize=10000, date_format=None, writer='bulk_import')[source]¶

Write a DataFrame to a Treasure Data table.

This method converts the dataframe into a series of key-value pairs and send them using the Treasure Data streaming API. The data is divided into chunks of rows (default 10,000) and uploaded separately. If upload failed, the client retries the process for a certain amount of time (max_cumul_retry_delay; default 600 secs). This method may fail and raise an exception when retries did not success, in which case the data may be partially inserted. Use the bulk import utility if you cannot accept partial inserts.

Parameters

frameDataFrame

DataFrame to be written.

namestring

Name of table to be written, in the form ‘database.table’.

conpytd.Client

A client for a Treasure Data account returned by pytd.pandas_td.connect.

if_exists{‘error’ (‘fail’), ‘overwrite’ (‘replace’), ‘append’, ‘ignore’}, default: ‘error’

What happens when a target table already exists. For pandas-td compatibility, ‘error’, ‘overwrite’, ‘append’, ‘ignore’ can respectively be:

fail: If table exists, raise an exception.
replace: If table exists, drop it, recreate it, and insert data.
append: If table exists, insert data. Create if does not exist.
ignore: If table exists, do nothing.

time_colstring, optional

Column name to use as “time” column for the table. Column type must be integer (unixtime), datetime, or string. If None is given (default), then the current time is used as time values.

time_indexint, optional

Level of index to use as “time” column for the table. Set 0 for a single index. This parameter implies index=False.

indexboolean, default: True

Write DataFrame index as a column.

index_labelstring or sequence, default: None

Column label for index column(s). If None is given (default) and index is True, then the index names are used. A sequence should be given if the DataFrame uses MultiIndex.

chunksizeint, default: 10,000

Number of rows to be inserted in each chunk from the dataframe. pytd: This argument will be ignored.

date_formatstring, default: None

Format string for datetime objects

writerstring, {‘bulk_import’, ‘insert_into’, ‘spark’}, or pytd.writer.Writer, default: ‘bulk_import’

A Writer to choose writing method to Treasure Data. If not given or string value, a temporal Writer instance will be created.

IPython Magics¶

IPython Magics

IPython magics to access to Treasure Data. Load the magics first of all:

In [1]: %load_ext pytd.pandas_td.ipython

class pytd.pandas_td.ipython.MagicContext[source]¶

__init__(self)[source]¶: Initialize self. See help(type(self)) for accurate signature.

connect(self)[source]¶

class pytd.pandas_td.ipython.MagicTable(table)[source]¶

__init__(self, table)[source]¶: Initialize self. See help(type(self)) for accurate signature.

pytd.pandas_td.ipython.get_td_magic_context()[source]¶

class pytd.pandas_td.ipython.TDMagics(shell)[source]¶

__init__(self, shell)[source]¶

Create a configurable given a config config.

Parameters

configConfig: If this is empty, default values are used. If config is a Config instance, it will be used to configure the instance.
parentConfigurable instance, optional: The parent Configurable instance of this object.

Notes

Subclasses of Configurable must call the __init__() method of Configurable before doing anything else and using super():

class MyConfigurable(Configurable):
    def __init__(self, config=None):
        super(MyConfigurable, self).__init__(config=config)
        # Then any other code you need to finish initialization.

This ensures that instances will be configured properly.

class pytd.pandas_td.ipython.DatabasesMagics(shell)[source]¶

td_databases(self, pattern)¶

List databases in the form of pandas.DataFrame.

%td_databases [<database_name_pattern>]

Parameters

``<database_name_pattern>``string, optional: List databases matched to a given pattern. If not given, all existing databases will be listed.

Returns

pandas.DataFrame

Examples

In [1]: %load_ext pytd.pandas_td.ipython

In [2]: %td_databases sample
Out[2]:
                                    name        count     permission                created_at                updated_at
0    xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx       348124  administrator 2019-01-23 05:48:11+00:00 2019-01-23 05:48:11+00:00
1                              yyyyyyyyy            0  administrator 2017-12-14 07:52:34+00:00 2017-12-14 07:52:34+00:00
2                          zzzzzzzzzzzzz            0  administrator 2016-05-25 23:12:06+00:00 2016-05-25 23:12:06+00:00
...

In [3]: %td_databases sample
Out[3]:
                         name     count     permission                created_at                updated_at
0                    sampledb         2  administrator 2014-04-11 22:29:38+00:00 2014-04-11 22:29:38+00:00
1             sample_xxxxxxxx         2  administrator 2017-06-02 23:37:41+00:00 2017-06-02 23:37:41+00:00
2             sample_datasets   8812278     query_only 2014-10-04 01:13:11+00:00 2018-03-16 04:59:06+00:00
...

magics = {'cell': {}, 'line': {'td_databases': 'td_databases'}}¶

registered = True¶

class pytd.pandas_td.ipython.TablesMagics(shell)[source]¶

td_tables(self, pattern)¶

List tables in databases.

%td_tables [<table_identifier_pattern>]

Parameters

``<table_identifier_pattern>``string, optional: List tables matched to a given pattern. Table identifier is represented as database_name.table_name. If not given, all existing tables will be listed.

Returns

pandas.DataFrame

Examples

In [1]: %load_ext pytd.pandas_td.ipython

In [2]: %td_tables
Out[2]:
            db_name                         name      count  estimated_storage_size        last_log_timestamp                created_at
0     xxxxx_demo_aa                customer_test         70                    1047 2018-02-05 06:20:32+00:00 2018-02-05 06:20:24+00:00
1     xxxxx_demo_aa                    email_log          0                       0 1970-01-01 00:00:00+00:00 2018-02-05 07:19:57+00:00
2             yy_wf           topk_similar_items      10598                  134208 2018-04-16 09:23:57+00:00 2018-04-16 09:59:48+00:00
...

In [3]: %td_tables sample
Out[3]:
                  db_name                                 name    count  estimated_storage_size        last_log_timestamp                created_at
0                 xx_test                      aaaaaaaa_sample        0                       0 1970-01-01 00:00:00+00:00 2015-10-20 17:37:40+00:00
1                sampledb                            sampletbl        2                     843 1970-01-01 00:00:00+00:00 2014-04-11 22:30:08+00:00
2            zzzz_test_db                    sample_output_tab        4                     889 2018-06-06 08:26:20+00:00 2018-06-06 08:27:12+00:00
...

magics = {'cell': {}, 'line': {'td_tables': 'td_tables'}}¶

registered = True¶

class pytd.pandas_td.ipython.JobsMagics(shell)[source]¶

td_jobs(self, line)¶

List job activities in an account.

%td_jobs

Returns

pandas.DataFrame

Examples

In [1]: %load_ext pytd.pandas_td.ipython

In [2]: %td_jobs
Out[2]:
     status     job_id    type                  start_at                                              query
0     error  448650806    hive 2019-04-12 05:33:36+00:00  with null_samples as (\n  select\n    id,\n   ...
1   success  448646994  presto 2019-04-12 05:23:29+00:00  -- read_td_query\n-- set session distributed_j...
2   success  448646986  presto 2019-04-12 05:23:27+00:00  -- read_td_query\n-- set session distributed_j...
...

magics = {'cell': {}, 'line': {'td_jobs': 'td_jobs'}}¶

registered = True¶

class pytd.pandas_td.ipython.UseMagics(shell)[source]¶

td_use(self, line)¶

Use a specific database.

This magic pushes all table names in a specified database into the current namespace.

%td_use [<database_name>]

Parameters

``<database_name>``string: Database name.

Examples

In [1]: %load_ext pytd.pandas_td.ipython

In [2]: %td_use sample_datasets
INFO: import nasdaq
INFO: import www_access

In [3]: nasdaq  # describe table columns in the form of DataFrame
Out[3]: <pytd.pandas_td.ipython.MagicTable at 0x117651908>

magics = {'cell': {}, 'line': {'td_use': 'td_use'}}¶

registered = True¶

class pytd.pandas_td.ipython.QueryMagics(shell)[source]¶

create_job_parser(self)[source]¶

parse_job_args(self, line)[source]¶

create_query_parser(self, engine_type)[source]¶

parse_query_args(self, engine_type, line)[source]¶

push_code(self, code, end='n')[source]¶

display_code_block(self)[source]¶

build_query(self, cell)[source]¶

build_engine(self, engine_type, database, args)[source]¶

convert_time(self, d)[source]¶

set_index(self, d, index, args)[source]¶

pivot(self, d, args)[source]¶

post_process(self, d, args)[source]¶

run_job(self, line)[source]¶

run_query(self, engine_type, line, cell)[source]¶

td_job(self, line)¶

Get job result.

%td_job [--pivot] [--plot] [--dry-run] [--verbose]
        [--connection <connection>] [--dropna] [--out <out>]
        [--out-file <out_file>] [--quiet] [--timezone <timezone>]
        job_id

Parameters

``<job_id>``integer: Job ID.
``–pivot``optional: Run pivot_table against dimensions.
``–plot``optional: Plot the query result.
``–dry_run``, ``-n``optional: Output translated code without running query.
``–verbose``, ``-v``optional: Verbose output.
``–connection <connection>``, ``-c <connection>``pytd.Client, optional: Use specified connection.
``–dropna``, ``-d``optional: Drop columns if all values are NA.
``–out <out>``, ``-o <out>``string, optional: Store the result to variable.
``–out-file <out_file>``, ``-O <out_file>``string, optional: Store the result to file.
``–quiet``, ``-q``optional: Disable progress output.
``–timezone <timezone>``, ``-T <timezone>``string, optional: Set timezone to time index.

Returns

pandas.DataFrame

Examples

In [1]: %load_ext pytd.pandas_td.ipython

In [2]: %td_job 451709460  # select * from sample_datasets.nasdaq limit 5
Out[2]:
                    symbol  open  volume     high      low    close
time
1992-08-25 16:00:00   ATRO   0.0    3900   0.7076   0.7076   0.7076
1992-08-25 16:00:00   ALOG   0.0   11200  11.0000  10.6250  11.0000
1992-08-25 16:00:00   ATAX   0.0   11400  11.3750  11.0000  11.0000
1992-08-25 16:00:00   ATRI   0.0    5400  14.3405  14.0070  14.2571
1992-08-25 16:00:00   ABMD   0.0   38800   5.7500   5.2500   5.6875

td_hive(self, line, cell)¶

Run a Hive query.

%%td_hive [<database>] [--pivot] [--plot] [--dry-run] [--verbose]
          [--connection <connection>] [--dropna] [--out <out>]
          [--out-file <out_file>] [--quiet] [--timezone <timezone>]

<query>

Parameters

``<query>``string: Hive query.
``<database>``string, optional: Database name.
``–pivot``optional: Run pivot_table against dimensions.
``–plot``optional: Plot the query result.
``–dry_run``, ``-n``optional: Output translated code without running query.
``–verbose``, ``-v``optional: Verbose output.
``–connection <connection>``, ``-c <connection>``pytd.Client, optional: Use specified connection.
``–dropna``, ``-d``optional: Drop columns if all values are NA.
``–out <out>``, ``-o <out>``string, optional: Store the result to variable.
``–out-file <out_file>``, ``-O <out_file>``string, optional: Store the result to file.
``–quiet``, ``-q``optional: Disable progress output.
``–timezone <timezone>``, ``-T <timezone>``string, optional: Set timezone to time index.

Returns

pandas.DataFrame

Examples

In [1]: %load_ext pytd.pandas_td.ipython

In [2]: %%td_hive
   ...: select hivemall_version()
   ...:
Out[2]:
                         _c0
0  0.6.0-SNAPSHOT-201901-r01

td_presto(self, line, cell)¶

Run a Presto query.

%%td_presto [<database>] [--pivot] [--plot] [--dry-run] [--verbose]
            [--connection <connection>] [--dropna] [--out <out>]
            [--out-file <out_file>] [--quiet] [--timezone <timezone>]

<query>

Parameters

``<query>``string: Presto query.
``<database>``string, optional: Database name.
``–pivot``optional: Run pivot_table against dimensions.
``–plot``optional: Plot the query result.
``–dry_run``, ``-n``optional: Output translated code without running query.
``–verbose``, ``-v``optional: Verbose output.
``–connection <connection>``, ``-c <connection>``pytd.Client, optional: Use specified connection.
``–dropna``, ``-d``optional: Drop columns if all values are NA.
``–out <out>``, ``-o <out>``string, optional: Store the result to variable.
``–out-file <out_file>``, ``-O <out_file>``string, optional: Store the result to file.
``–quiet``, ``-q``optional: Disable progress output.
``–timezone <timezone>``, ``-T <timezone>``string, optional: Set timezone to time index.

Returns

pandas.DataFrame

Examples

In [1]: %load_ext pytd.pandas_td.ipython

In [2]: %%td_presto
   ...: select * from sample_datasets.nasdaq limit 5
   ...:
Out[2]:
                    symbol  open  volume     high      low    close
time
1989-01-26 16:00:00   SMTC   0.0    8000   0.4532   0.4532   0.4532
1989-01-26 16:00:00   SEIC   0.0  163200   0.7077   0.6921   0.7025
1989-01-26 16:00:00   SIGI   0.0    2800   3.9610   3.8750   3.9610
1989-01-26 16:00:00   NAVG   0.0    1800  14.6740  14.1738  14.6740
1989-01-26 16:00:00   MOCO   0.0   71101   3.6722   3.5609   3.5980

magics = {'cell': {'td_hive': 'td_hive', 'td_presto': 'td_presto'}, 'line': {'td_job': 'td_job'}}¶

registered = True¶

pytd.pandas_td.ipython.load_ipython_extension(ipython)[source]¶

pandas-td Compatibility¶

Compatible Functions¶

IPython Magics¶

pytd

Navigation

Related Topics