BlazingSQL

BlazingSQL Documentation

Welcome to our Documentation and Support Page!

BlazingSQL is a GPU accelerated SQL engine built on top of the RAPIDS data science framework. RAPIDS is a collection of open-source libraries for end-to-end data science pipelines entirely in the GPU. BlazingSQL extends RAPIDS and enables users to run SQL queries on Apache Arrow in GPU memory.

Please install, test, deploy, and gripe in our Discussion board.

Get Started    Discussions

2 Minutes to BlazingSQL

Inspired by 10 Minutes to pandas, but we assume you know SQL, so there isn't all that much to teach. This is a short introduction to BlazingSQL in the RAPIDS ecosystem.

Import packages

For now, you need to import both cuDF and pyblazing.

import cudf
import pyblazing

Querying Data

See Data Definition Language (DDL) for more details.

BlazingSQL is a query engine for GPU DataFrames (GDFs). These DataFrames are part of the RAPIDS ecosystem and are built on Apache Arrow in GPU memory.

BlazingSQL V0.2 can query data in-memory or stored in files. Our currently supported formats are:

  • In-Memory
    • GPU DataFrame (GDF)
    • Pandas
    • Apache Arrow
  • Files
    • CSV

For this tutorial, we will show you how to query a GPU DataFrame.

Query a GPU DataFrame

# Define Column Names and Column Data Types
column_names = ['n_nationkey', 'n_name', 'n_regionkey', 'n_comments']
column_types = ['int32', 'int64', 'int32', 'int64']

# Create GPU DataFrame from CSV File (similar to Pandas)
nation_gdf = cudf.read_csv("../data/nation.csv", delimiter='|',
                           dtype=column_types, names=column_names)

# Create Tables Dictionary
tables = {'nation': nation_gdf}

#SQL Query
sql = 'select n_nationkey, n_regionkey, n_nationkey + n_regionkey as addition from main.nation'

#Execute Query
result_gdf = pyblazing.run_query_pandas(sql, tables)

print(sql)
print(result_gdf)

Query a CSV File

import pyblazing
from pyblazing import SchemaFrom

#Register Filesystem
register_hdfs()

# Define Column Names and Column Data Types
names = ['n_nationkey', 'n_name', 'n_regionkey', 'n_comment']
dtypes = ['int32', 'int64', 'int32', 'int64']

# Create Table Reference
nation_schema = pyblazing.register_table_schema(table_name='nation', type=SchemaFrom.CsvFile, path='hdfs://tpch_hdfs/Data1Mb/nation_0_0.csv', delimiter='|', dtypes=dtypes, names=names)
table_data = {
  nation_schema: ['hdfs://tpch_hdfs/Data1Mb/nation_0_0.csv']
}

#SQL Query
sql = 'select n_nationkey, n_regionkey + n_nationkey as addition from main.nation'

#Execute Query
result_gdf = pyblazing.run_query_filesystem(sql, table_data)

print(sql)
print(result_gdf)

SQL

See Data Manipulation Language (DML) for more details.

We support a fair amount of SQL. We are rapidly expanding SQL coverage and use Apache Calcite. We are also finding SQL coverage when our primitives are sufficient to implement the query plan delivered by Calcite.

Here is a quick list of supported SQL:

  • SELECT
  • WHERE
  • GROUP BY
  • JOIN
  • ORDER BY
  • UNARY FUNCTIONS
  • BINARY FUNCTIONS

Thanks! Hope that helped, and let us know if it really didn't!


What's Next

Learn the core features of BlazingSQL.

File System
Data Definition Language (DDL)
Data Manipulation Language (DML)