Skip to main content
Data / ML, Engineering

Introducing AresDB: Uber’s GPU-Powered Open Source, Real-time Analytics Engine

January 29, 2019 / Global
Featured image for Introducing AresDB: Uber’s GPU-Powered Open Source, Real-time Analytics Engine
DashboardsDecision SystemsAd hoc Queries
Query PatternWell knownWell knownArbitrary
Query QPSHighHighLow
Query LatencyLowLowHigh
DatasetSubsetSubsetAll data
Figure 1: Comparison of CPU and GPU single precision floating point performance through the years. Image taken from Nvidia’s CUDA C programming guide.
Figure 2: The AresDB single instance architecture features memory and disk stores, and meta stores.
Storage (in Bytes)Details
Bool1/8Boolean type data, stored as single bit
Int8, Uint81Integer number types. User can choose based on cardinality of field and memory cost.
Int16, Uint162
Int32, Uint324
SmallEnum1Strings are auto translated into enums. SmallEnum can holds string type with cardinality up to 256
BigEnum2Similar to SmallEnum, but holds higher cardinality up to 65535
Float324Floating point number. We support Float32 and intend to add Float64 support as needed
UUID16
GeoPoint4Geographic points
GeoShapeVariable LengthPolygon or multi-polygons
Figure 3: We use a primary key value to locate the batch and position within the batch for each record.
Figure 4: We store values (actual value) and null vectors (validity) for uncompressed columns in the data table.
Figure 5: We sort all rows by city_id, followed by status, then compress each column using run-length encoding. Each column will have a count vector after being sorted and compressed.
Figure 6: During ingestion, after the upsert batch is appended to the redo log, “late” records will be appended to a backfill queue while other records will be applied to the live store.
Figure 7: We use event time and cut-off times to determine which records are new (live) and old (with an event time older than the archiving cut-off).
Figure 8: AresDB’s query execution flow leverages our homegrown AQL query language for fast, efficient data processing and retrieval.
Figure 9: AresDB pre-filters columnar data before sending it to the GPU for processing.
Figure 10: With AresDB, two CUDA streams alternate on data transfer and processing.
Figure 11: AresDB leverages the OOPK model model for expression evaluation.
Figure 12: After expression evaluation, AresDB sorts and reduces data by key value on the dimension (key value) and measure (value) vectors.
AllocationManagement Mode
Live Store Vectors (live store columnar data)CTracked
Archive Store Vectors (archive store columnar data)CManaged (Load and eviction)
Primary Key Index (hash table for record deduplication)CTracked
Backfill Queue (store “late” arrival data waiting for backfill)GolangTracked
Archive / Backfill Process Temporary Storage (Temporary memory allocated during the Archive and Backfill process)CTracked
Golang and CStatically Configured Estimate
Figure 13: AresDB manages its own memory usage so that it does not exceed the configured total process budget.
Figure 14: The Uber Summary Dashboard’s hourly view uses AresDB to view real-time data analytics during specific time periods.
trip_idrequest_atcity_id statusdriver_idfare
115420588701completed28.5
215419772001rejected310.75
city_idcity_nametimezone
1San FranciscoAmerica/Los_Angeles
2New YorkAmerica/New_York
TripsCities
{
 “name”: “trips”,
 “columns”: [
   {
     “name”: “request_at”,
     “type”: “Uint32”,
   },
   {
     “name”: “trip_id”,
     “type”: “UUID”
   },
   {
     “name”: “city_id”,
     “type”: “Uint16”,
   },
   {
     “name”: “status”,
     “type”: “SmallEnum”,
   },
   {
     “name”: “driver_id”,
     “type”: “UUID”
   },
   {
     “name”: “fare”,
     “type”: “Float32”,
   }
 ],
 “primaryKeyColumns”: [
   1
 ],
 “isFactTable”: true,
 “config”: {
   “batchSize”: 2097152,
   “archivingDelayMinutes”: 1440,
   “archivingIntervalMinutes”: 180,
  “recordRetentionInDays”: 30
 },
 “archivingSortColumns”: [2,3]
}
{
 “name”: “cities”,
 “columns”: [
 {
     “name”: “city_id”,
    “type”: “Uint16”,
   },
   {
     “name”: “city_name”,
     “type”: “SmallEnum”
   },
   {
     “name”: “timezone”,
     “type”: “SmallEnum”,
   }
 ],
 “primaryKeyColumns”: [
   0
 ],
 “isFactTable”: false,
 “config”: {
   “batchSize”: 2097152
 }
}
Total trips fare in San Francisco in the last 24 hours group by hoursActive drivers in San Francisco in the last 24 hours group by hours
{
 “table”: “trips”,
 “joins”: [
   {
     “alias”: “cities”,
     “name”: “cities”,
     “conditions”: [
       “cities.id = trips.city_id”
     ]
   }
 ],
 “dimensions”: [
   {
     “sqlExpression”: “request_at”,
     “timeBucketizer”: “hour”
   }
 ],
 “measures”: [
   {
     “sqlExpression”: “sum(fare)”
   }
 ],
 “rowFilters”: [
   “status = ‘completed'”,
   “cities.city_name = ‘San Francisco'”
 ],
 “timeFilter”: {
   “column”: “request_at”,
   “from”: “24 hours ago”
 },
 “timezone”: “America/Los_Angeles”
}
{
 “table”: “trips”,
 “joins”: [
   {
     “alias”: “cities”,
     “name”: “cities”,
     “conditions”: [
       “cities.id = trips.city_id”
     ]
   }
 ],
 “dimensions”: [
   {
     “sqlExpression”: “request_at”,
     “timeBucketizer”: “hour”
   }
 ],
 “measures”: [
   {
     “sqlExpression”: “countDistinctHLL(driver_id)”
   }
 ],
 “rowFilters”: [
   “status = ‘completed'”,
   “cities.city_name = ‘San Francisco'”
 ],
 “timeFilter”: {
   “column”: “request_at”,
   “from”: “24 hours ago”
 },
 “timezone”: “America/Los_Angeles”
}
Total trips fare in San Francisco in the last 24 hours group by hoursActive drivers in San Francisco in the last 24 hours group by hours
{
 “results”: [
   {
     “1547060400”: 1000.0,
     “1547064000”: 1000.0,
     “1547067600”: 1000.0,
     “1547071200”: 1000.0,
     “1547074800”: 1000.0,
     …
   }
 ]
}
{
 “results”: [
   {
     “1547060400”: 100,
     “1547064000”: 100,
     “1547067600”: 100,
     “1547071200”: 100,
     “1547074800”: 100,
    …  
   }
 ]
}
Jian Shen

Jian Shen

Jian Shen is a senior software engineer on Uber’s Streaming Data team, working on real-time analytics.

Ze Wang

Ze Wang

Ze Wang is a senior software engineer on Uber's Real-time Analytics team.

David Wang

David Wang

David Wang is an Engineering Manager for Uber's Signup and Login team.

Jeremy Shi

Jeremy Shi

Jeremy Shi is a senior software engineer on Uber's Real-time Analytics team.

Steven Chen

Steven Chen

Steven Chen is an engineering manager on Uber's Big Data team.

Posted by Jian Shen, Ze Wang, David Wang, Jeremy Shi, Steven Chen