BlazingSQL

BlazingSQL Documentation

Welcome to our Documentation and Support Page!

BlazingSQL is a GPU accelerated SQL engine built on top of the RAPIDS data science framework. RAPIDS is a collection of open-source libraries for end-to-end data science pipelines entirely in the GPU. BlazingSQL extends RAPIDS and enables users to run SQL queries on Apache Arrow in GPU memory.

Please install, test, deploy, and gripe in our Discussion board.

Get Started    Discussions

Data Definition Language (DDL)

BlazingSQL is not a database, it is a query execution engine, therefore DDL is referring to the data types we support, and how users create objects in BlazingSQL for querying.

The following topics

  • Data Types
  • Object Creation

Data Types

These are the currently supported data types for BlazingSQL:

  • Integer
    • INT8
    • INT16
    • INT32
    • INT64
  • Floating Point
    • FLOAT8
    • FLOAT16
    • FLOAT32
    • FLOAT64
  • Date Time
    • DATE32 (INT32_T DAYS since the UNIX Epoch)
    • DATE64 (INT64_T MILLISECONDS since the UNIX Epoch)
    • TIMESTAMP (INT64 encoded as exact TIMESTAMP since the UNIX Epoch)

Null and Strings Support

It might be obvious that we are missing String and Null support. This is getting worked on as quickly as possible. We should have an update before the end of January 2019.

Object Creation

BlazingSQL V0.2 can query data in-memory or stored in files. Our currently supported formats are:

  • In-Memory
    • GPU DataFrame (GDF)
    • Pandas
    • Apache Arrow
  • Files
    • CSV
    • Apache Parquet

File Format Support

Please let us know if you want us to work on a particular file format in our #Discussion forum here: https://docs.blazingdb.com/discuss

GPU DataFrame (GDF)

import cudf
import pyblazing

# Define Column Names and Column Data Types
column_names = ['n_nationkey', 'n_name', 'n_regionkey', 'n_comments']
column_types = ['int32', 'int64', 'int32', 'int64']

# Create GPU DataFrame from CSV File (similar to Pandas)
nation_gdf = cudf.read_csv("../data/nation.csv", delimiter='|',
                           dtype=column_types, names=column_names)

# Create Tables Dictionary
tables = {'nation': nation_gdf}

#SQL Query
sql = 'select n_nationkey, n_regionkey, n_nationkey + n_regionkey as addition from main.nation'

#Execute Query
result_gdf = pyblazing.run_query(sql, tables)

print(sql)
print(result_gdf)

GDF Queries

You must include your tables dictionary in your run_query() method call.
Ex: pyblazing.run_query(sql, tables)

Pandas DataFrame

import pandas as pd
import pyblazing

# Define Column Names and Column Data Types
column_names = ['n_nationkey', 'n_name', 'n_regionkey', 'n_comments']
column_types = {'n_nationkey': 'int32', 'n_regionkey', 'int64'}

# Create Pandas DataFrame from CSV File
nation_df = pd.read_csv("../data/nation.csv", delimiter='|',
                        dtype=column_types, names=column_names)

# Create Tables Dictionary
tables = {'nation': nation_df}

#SQL Query
sql = 'select n_nationkey, n_regionkey, n_nationkey + n_regionkey as addition from main.nation'

#Execute Query
result_gdf = pyblazing.run_query_pandas(sql, tables)

print(sql)
print(result_gdf)

Apache Arrow Data

import pyarrow as pa
import pyblazing

#Create Apache Arrow Object
arrow_table = pa.RecordBatchStreamReader('../data/gpu.arrow').read_all()

#Convert to Pandas DataFrame
df = arrow_table.to_pandas()

#Create Table Dictionary
tables = {'gpu_info': df}

#SQL Query
sql = 'select swings+1, tractions+10 from main.gpu_info'

#Execute Query
result_gdf = pyblazing.run_query(sql, tables)

print(sql)
print(result_gdf)

CSV File

In order to query a CSV file with BlazingSQL you need to define the schema. For now (changes coming soon) you must use numpy datatype definitions which are integers.

Data Type
Number

INT8

1

INT16

2

INT32

3

INT64

4

FLOAT32

5

FLOAT64

6

DATE32

7

DATE64

8

TIMESTAMP

9

import pyblazing
from pyblazing import SchemaFrom

#Register Filesystem
register_hdfs()

# Define Column Names and Column Data Types
names = ['n_nationkey', 'n_name', 'n_regionkey', 'n_comment']
dtypes = [3, 4, 3, 4]

# Create Table Reference
nation_schema = pyblazing.register_table_schema(table_name='nation', type=SchemaFrom.CsvFile, path='hdfs://tpch_hdfs/Data1Mb/nation_0_0.csv', delimiter='|', dtypes=dtypes, names=names)
table_data = {
  nation_schema: ['hdfs://tpch_hdfs/Data1Mb/nation_0_0.csv']
}

#SQL Query
sql = 'select n_nationkey, n_regionkey + n_nationkey as addition from main.nation'

#Execute Query
result_gdf = pyblazing.run_query_filesystem(sql, table_data)

print(sql)
print(result_gdf)

Apache Parquet

import pyblazing
from pyblazing import SchemaFrom

#Register Apache Parquet Tables
customer_schema = pyblazing.register_table_schema(table_name='customer_parquet', type=SchemaFrom.ParquetFile, path='/tmp/DataSet50mb/customer_0_0.parquet')

nation_schema = pyblazing.register_table_schema(table_name='nation_parquet', type=SchemaFrom.ParquetFile, path='/tmp/DataSet50mb/nation_0_0.parquet')

# Create Table Reference
sql_data = {
  customer_schema: ['/tmp/DataSet50mb/customer_0_0.parquet',
                    '/tmp/DataSet50mb/customer_0_1.parquet'],
  nation_schema: ['/tmp/DataSet50mb/nation_0_0.parquet']
}

#SQL Query
sql = '''
    select avg(c.c_custkey), avg(c.c_nationkey), n.n_regionkey
    from main.customer_parquet as c
    inner join main.nation_parquet as n
    on c.c_nationkey = n.n_nationkey
    group by n.n_regionkey
'''

#Execute Query
result_gdf = pyblazing.run_query_filesystem(sql, sql_data)

print(sql)
print(result_gdf)