APPROX_COUNT_DISTINCT
The approximate number of distinct values in the time window.
Use this function instead of COUNT_DISTINCT
to improve performance. This should be used when there are relatively few (under 1 million) rows, and where the total number of distinct values is high.
Syntax
APPROX_COUNT_DISTINCT(VALUE)
Arguments
VALUE:
Can be of any type
Returns
An integer.
Notes
Duplicate rows are not counted. NULLs are not counted.
Example
Data
[
{
"GROUP_ID":"G1",
"USER_ID":"U1",
"CONNECTION_TIME":"2020-06-26 02:31:29,573"
},
{
"GROUP_ID":"G1",
"USER_ID":"U2",
"CONNECTION_TIME":"2020-06-26 18:11:45,783"
},
{
"GROUP_ID":"G2",
"USER_ID":"Z1",
"CONNECTION_TIME":"2020-06-26 23:54:27,687"
},
{
"GROUP_ID":"G2",
"CONNECTION_TIME":"2020-07-26 23:54:27,687"
},
{
"GROUP_ID":"G1",
"USER_ID":"U2",
"CONNECTION_TIME":"2021-07-01 02:31:29,573"
},
{
"GROUP_ID":"G1",
"USER_ID":"U1",
"CONNECTION_TIME":"2021-07-01 18:11:45,783"
},
{
"GROUP_ID":"G2",
"USER_ID":"Z1",
"CONNECTION_TIME":"2021-07-01 23:54:27,687"
}
]
Query
Count the number of distinct entries for each unique entry in a field:
SET
DATE_TIME = TO_DATE(data.CONNECTION_TIME);
SELECT
APPROX_COUNT_DISTINCT(data.USER_ID)
AS approx_count_distinct_data_user_id:BIGINT,
data.GROUP_ID AS group_id:STRING
FROM
"SAMPLE_DATA_G1U1 - json"
GROUP BY
data.GROUP_ID
Results
{
"count_distinct_data_user_id":1,
"group_id":"G2"
}{
"count_distinct_data_user_id":2,
"group_id":"G1"
}
Related Functions
COUNT_DISTINCT APPROX_COUNT_DISTINCT_EACH
Dialog

Last updated
Was this helpful?