PostgreSQL: Correct way for subset calculations












1















I have a huge table (around 10 million items). For simplicity, let's say it has only 2 columns: user_id and activity_id like this



user_id | activity_id
---------------------
1 | 1
1 | 2
1 | 3
2 | 1
2 | 2


I want to select all user_id with activity_id = 1, 2 NOT 3. In the case above it will be just one result: user_id = 2. I can do it using SELECT DISTINCT combined with INTERSECT and EXCEPT operators, but it seems to be extremely slow.



From what I know about databases, it can be improved with GIN and table partitioning, however I feel like it's not correct solution in the case of PostgreSQL (because subsets operators are slow by their own).










share|improve this question









New contributor




Ximik is a new contributor to this site. Take care in asking for clarification, commenting, and answering.
Check out our Code of Conduct.





















  • Please post your table(s) definition(s), query, and execution plan.

    – Colin 't Hart
    10 hours ago











  • @a_horse_with_no_name activity_id=4 changes nothing in such a case. It's 10.000.000 rows.

    – Ximik
    9 hours ago











  • When you say activity id = 1, 2 do you mean 1 AND 2, or 1 OR 2?

    – Evan Carroll
    8 hours ago
















1















I have a huge table (around 10 million items). For simplicity, let's say it has only 2 columns: user_id and activity_id like this



user_id | activity_id
---------------------
1 | 1
1 | 2
1 | 3
2 | 1
2 | 2


I want to select all user_id with activity_id = 1, 2 NOT 3. In the case above it will be just one result: user_id = 2. I can do it using SELECT DISTINCT combined with INTERSECT and EXCEPT operators, but it seems to be extremely slow.



From what I know about databases, it can be improved with GIN and table partitioning, however I feel like it's not correct solution in the case of PostgreSQL (because subsets operators are slow by their own).










share|improve this question









New contributor




Ximik is a new contributor to this site. Take care in asking for clarification, commenting, and answering.
Check out our Code of Conduct.





















  • Please post your table(s) definition(s), query, and execution plan.

    – Colin 't Hart
    10 hours ago











  • @a_horse_with_no_name activity_id=4 changes nothing in such a case. It's 10.000.000 rows.

    – Ximik
    9 hours ago











  • When you say activity id = 1, 2 do you mean 1 AND 2, or 1 OR 2?

    – Evan Carroll
    8 hours ago














1












1








1








I have a huge table (around 10 million items). For simplicity, let's say it has only 2 columns: user_id and activity_id like this



user_id | activity_id
---------------------
1 | 1
1 | 2
1 | 3
2 | 1
2 | 2


I want to select all user_id with activity_id = 1, 2 NOT 3. In the case above it will be just one result: user_id = 2. I can do it using SELECT DISTINCT combined with INTERSECT and EXCEPT operators, but it seems to be extremely slow.



From what I know about databases, it can be improved with GIN and table partitioning, however I feel like it's not correct solution in the case of PostgreSQL (because subsets operators are slow by their own).










share|improve this question









New contributor




Ximik is a new contributor to this site. Take care in asking for clarification, commenting, and answering.
Check out our Code of Conduct.












I have a huge table (around 10 million items). For simplicity, let's say it has only 2 columns: user_id and activity_id like this



user_id | activity_id
---------------------
1 | 1
1 | 2
1 | 3
2 | 1
2 | 2


I want to select all user_id with activity_id = 1, 2 NOT 3. In the case above it will be just one result: user_id = 2. I can do it using SELECT DISTINCT combined with INTERSECT and EXCEPT operators, but it seems to be extremely slow.



From what I know about databases, it can be improved with GIN and table partitioning, however I feel like it's not correct solution in the case of PostgreSQL (because subsets operators are slow by their own).







postgresql






share|improve this question









New contributor




Ximik is a new contributor to this site. Take care in asking for clarification, commenting, and answering.
Check out our Code of Conduct.











share|improve this question









New contributor




Ximik is a new contributor to this site. Take care in asking for clarification, commenting, and answering.
Check out our Code of Conduct.









share|improve this question




share|improve this question








edited 8 hours ago









Evan Carroll

31.5k865210




31.5k865210






New contributor




Ximik is a new contributor to this site. Take care in asking for clarification, commenting, and answering.
Check out our Code of Conduct.









asked 12 hours ago









XimikXimik

1063




1063




New contributor




Ximik is a new contributor to this site. Take care in asking for clarification, commenting, and answering.
Check out our Code of Conduct.





New contributor





Ximik is a new contributor to this site. Take care in asking for clarification, commenting, and answering.
Check out our Code of Conduct.






Ximik is a new contributor to this site. Take care in asking for clarification, commenting, and answering.
Check out our Code of Conduct.













  • Please post your table(s) definition(s), query, and execution plan.

    – Colin 't Hart
    10 hours ago











  • @a_horse_with_no_name activity_id=4 changes nothing in such a case. It's 10.000.000 rows.

    – Ximik
    9 hours ago











  • When you say activity id = 1, 2 do you mean 1 AND 2, or 1 OR 2?

    – Evan Carroll
    8 hours ago



















  • Please post your table(s) definition(s), query, and execution plan.

    – Colin 't Hart
    10 hours ago











  • @a_horse_with_no_name activity_id=4 changes nothing in such a case. It's 10.000.000 rows.

    – Ximik
    9 hours ago











  • When you say activity id = 1, 2 do you mean 1 AND 2, or 1 OR 2?

    – Evan Carroll
    8 hours ago

















Please post your table(s) definition(s), query, and execution plan.

– Colin 't Hart
10 hours ago





Please post your table(s) definition(s), query, and execution plan.

– Colin 't Hart
10 hours ago













@a_horse_with_no_name activity_id=4 changes nothing in such a case. It's 10.000.000 rows.

– Ximik
9 hours ago





@a_horse_with_no_name activity_id=4 changes nothing in such a case. It's 10.000.000 rows.

– Ximik
9 hours ago













When you say activity id = 1, 2 do you mean 1 AND 2, or 1 OR 2?

– Evan Carroll
8 hours ago





When you say activity id = 1, 2 do you mean 1 AND 2, or 1 OR 2?

– Evan Carroll
8 hours ago










3 Answers
3






active

oldest

votes


















2














You can easily do this with arrays in Postgres:



select user_id, array_agg(activity_id) as activities
from users
group by user_id
having array_agg(activity_id) @> array[1,2]
and not 3 = any(array_agg(activity_id));




The condition array_agg(activity_id) @> array[1,2] only returns those that have activity_ids 1 and 2 and the condition not 3 = any(array_agg(activity_id)) removes all those that contain activity_id = 3



If the table contains more than just those two columns, an index on (user_id, activitiy_id) will help as it enables Postgres to use an "Index Only Scan" instead of a full table scan. If there are only very users that have activity_ids 1 and two, an additional condition that only returns rows with either one of them (e.g. using a where exists condition) might help as it reduces the number of rows that need to be aggregated. In that case the index should be on (activity_id, user_id) to enable Postgres to remove unwanted rows efficiently.



On a table with 100.000 rows this ran in about 100ms on my laptop with Postgres 11 and a SSD.



Online example: https://rextester.com/YLN7221






share|improve this answer

































    1














    You can first try to rewrite the query using EXISTS and a regular (B-tree) index on user_id and activity_id.



    CREATE INDEX elbat_user_id_activity_id
    ON elbat (user_id,
    activity_id);





    SELECT DISTINCT t1.user_id
    FROM elbat t1
    WHERE EXISTS (SELECT *
    FROM elbat t2
    WHERE t2.user_id = t1.user_id
    AND t2.activity_id = '1')
    AND EXISTS (SELECT *
    FROM elbat t2
    WHERE t2.user_id = t1.user_id
    AND t2.activity_id = '2')
    AND NOT EXISTS (SELECT *
    FROM elbat t2
    WHERE t2.user_id = t1.user_id
    AND t2.activity_id = '3');


    If you have a user table, you might also want to join to that instead of retrieving the distinct user ID from the other table.






    share|improve this answer































      0














      You can use bool_or for this.



      SELECT user_id
      FROM users
      GROUP BY user_id
      HAVING bool_or(activity_id IN (1,2)) -- assumes '1, 2' mean '1 OR 2'
      AND NOT bool_or(activity_id IN (3))





      share|improve this answer























        Your Answer








        StackExchange.ready(function() {
        var channelOptions = {
        tags: "".split(" "),
        id: "182"
        };
        initTagRenderer("".split(" "), "".split(" "), channelOptions);

        StackExchange.using("externalEditor", function() {
        // Have to fire editor after snippets, if snippets enabled
        if (StackExchange.settings.snippets.snippetsEnabled) {
        StackExchange.using("snippets", function() {
        createEditor();
        });
        }
        else {
        createEditor();
        }
        });

        function createEditor() {
        StackExchange.prepareEditor({
        heartbeatType: 'answer',
        autoActivateHeartbeat: false,
        convertImagesToLinks: false,
        noModals: true,
        showLowRepImageUploadWarning: true,
        reputationToPostImages: null,
        bindNavPrevention: true,
        postfix: "",
        imageUploader: {
        brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
        contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
        allowUrls: true
        },
        onDemand: true,
        discardSelector: ".discard-answer"
        ,immediatelyShowMarkdownHelp:true
        });


        }
        });






        Ximik is a new contributor. Be nice, and check out our Code of Conduct.










        draft saved

        draft discarded


















        StackExchange.ready(
        function () {
        StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fdba.stackexchange.com%2fquestions%2f227083%2fpostgresql-correct-way-for-subset-calculations%23new-answer', 'question_page');
        }
        );

        Post as a guest















        Required, but never shown

























        3 Answers
        3






        active

        oldest

        votes








        3 Answers
        3






        active

        oldest

        votes









        active

        oldest

        votes






        active

        oldest

        votes









        2














        You can easily do this with arrays in Postgres:



        select user_id, array_agg(activity_id) as activities
        from users
        group by user_id
        having array_agg(activity_id) @> array[1,2]
        and not 3 = any(array_agg(activity_id));




        The condition array_agg(activity_id) @> array[1,2] only returns those that have activity_ids 1 and 2 and the condition not 3 = any(array_agg(activity_id)) removes all those that contain activity_id = 3



        If the table contains more than just those two columns, an index on (user_id, activitiy_id) will help as it enables Postgres to use an "Index Only Scan" instead of a full table scan. If there are only very users that have activity_ids 1 and two, an additional condition that only returns rows with either one of them (e.g. using a where exists condition) might help as it reduces the number of rows that need to be aggregated. In that case the index should be on (activity_id, user_id) to enable Postgres to remove unwanted rows efficiently.



        On a table with 100.000 rows this ran in about 100ms on my laptop with Postgres 11 and a SSD.



        Online example: https://rextester.com/YLN7221






        share|improve this answer






























          2














          You can easily do this with arrays in Postgres:



          select user_id, array_agg(activity_id) as activities
          from users
          group by user_id
          having array_agg(activity_id) @> array[1,2]
          and not 3 = any(array_agg(activity_id));




          The condition array_agg(activity_id) @> array[1,2] only returns those that have activity_ids 1 and 2 and the condition not 3 = any(array_agg(activity_id)) removes all those that contain activity_id = 3



          If the table contains more than just those two columns, an index on (user_id, activitiy_id) will help as it enables Postgres to use an "Index Only Scan" instead of a full table scan. If there are only very users that have activity_ids 1 and two, an additional condition that only returns rows with either one of them (e.g. using a where exists condition) might help as it reduces the number of rows that need to be aggregated. In that case the index should be on (activity_id, user_id) to enable Postgres to remove unwanted rows efficiently.



          On a table with 100.000 rows this ran in about 100ms on my laptop with Postgres 11 and a SSD.



          Online example: https://rextester.com/YLN7221






          share|improve this answer




























            2












            2








            2







            You can easily do this with arrays in Postgres:



            select user_id, array_agg(activity_id) as activities
            from users
            group by user_id
            having array_agg(activity_id) @> array[1,2]
            and not 3 = any(array_agg(activity_id));




            The condition array_agg(activity_id) @> array[1,2] only returns those that have activity_ids 1 and 2 and the condition not 3 = any(array_agg(activity_id)) removes all those that contain activity_id = 3



            If the table contains more than just those two columns, an index on (user_id, activitiy_id) will help as it enables Postgres to use an "Index Only Scan" instead of a full table scan. If there are only very users that have activity_ids 1 and two, an additional condition that only returns rows with either one of them (e.g. using a where exists condition) might help as it reduces the number of rows that need to be aggregated. In that case the index should be on (activity_id, user_id) to enable Postgres to remove unwanted rows efficiently.



            On a table with 100.000 rows this ran in about 100ms on my laptop with Postgres 11 and a SSD.



            Online example: https://rextester.com/YLN7221






            share|improve this answer















            You can easily do this with arrays in Postgres:



            select user_id, array_agg(activity_id) as activities
            from users
            group by user_id
            having array_agg(activity_id) @> array[1,2]
            and not 3 = any(array_agg(activity_id));




            The condition array_agg(activity_id) @> array[1,2] only returns those that have activity_ids 1 and 2 and the condition not 3 = any(array_agg(activity_id)) removes all those that contain activity_id = 3



            If the table contains more than just those two columns, an index on (user_id, activitiy_id) will help as it enables Postgres to use an "Index Only Scan" instead of a full table scan. If there are only very users that have activity_ids 1 and two, an additional condition that only returns rows with either one of them (e.g. using a where exists condition) might help as it reduces the number of rows that need to be aggregated. In that case the index should be on (activity_id, user_id) to enable Postgres to remove unwanted rows efficiently.



            On a table with 100.000 rows this ran in about 100ms on my laptop with Postgres 11 and a SSD.



            Online example: https://rextester.com/YLN7221







            share|improve this answer














            share|improve this answer



            share|improve this answer








            edited 11 hours ago

























            answered 11 hours ago









            a_horse_with_no_namea_horse_with_no_name

            38.9k775112




            38.9k775112

























                1














                You can first try to rewrite the query using EXISTS and a regular (B-tree) index on user_id and activity_id.



                CREATE INDEX elbat_user_id_activity_id
                ON elbat (user_id,
                activity_id);





                SELECT DISTINCT t1.user_id
                FROM elbat t1
                WHERE EXISTS (SELECT *
                FROM elbat t2
                WHERE t2.user_id = t1.user_id
                AND t2.activity_id = '1')
                AND EXISTS (SELECT *
                FROM elbat t2
                WHERE t2.user_id = t1.user_id
                AND t2.activity_id = '2')
                AND NOT EXISTS (SELECT *
                FROM elbat t2
                WHERE t2.user_id = t1.user_id
                AND t2.activity_id = '3');


                If you have a user table, you might also want to join to that instead of retrieving the distinct user ID from the other table.






                share|improve this answer




























                  1














                  You can first try to rewrite the query using EXISTS and a regular (B-tree) index on user_id and activity_id.



                  CREATE INDEX elbat_user_id_activity_id
                  ON elbat (user_id,
                  activity_id);





                  SELECT DISTINCT t1.user_id
                  FROM elbat t1
                  WHERE EXISTS (SELECT *
                  FROM elbat t2
                  WHERE t2.user_id = t1.user_id
                  AND t2.activity_id = '1')
                  AND EXISTS (SELECT *
                  FROM elbat t2
                  WHERE t2.user_id = t1.user_id
                  AND t2.activity_id = '2')
                  AND NOT EXISTS (SELECT *
                  FROM elbat t2
                  WHERE t2.user_id = t1.user_id
                  AND t2.activity_id = '3');


                  If you have a user table, you might also want to join to that instead of retrieving the distinct user ID from the other table.






                  share|improve this answer


























                    1












                    1








                    1







                    You can first try to rewrite the query using EXISTS and a regular (B-tree) index on user_id and activity_id.



                    CREATE INDEX elbat_user_id_activity_id
                    ON elbat (user_id,
                    activity_id);





                    SELECT DISTINCT t1.user_id
                    FROM elbat t1
                    WHERE EXISTS (SELECT *
                    FROM elbat t2
                    WHERE t2.user_id = t1.user_id
                    AND t2.activity_id = '1')
                    AND EXISTS (SELECT *
                    FROM elbat t2
                    WHERE t2.user_id = t1.user_id
                    AND t2.activity_id = '2')
                    AND NOT EXISTS (SELECT *
                    FROM elbat t2
                    WHERE t2.user_id = t1.user_id
                    AND t2.activity_id = '3');


                    If you have a user table, you might also want to join to that instead of retrieving the distinct user ID from the other table.






                    share|improve this answer













                    You can first try to rewrite the query using EXISTS and a regular (B-tree) index on user_id and activity_id.



                    CREATE INDEX elbat_user_id_activity_id
                    ON elbat (user_id,
                    activity_id);





                    SELECT DISTINCT t1.user_id
                    FROM elbat t1
                    WHERE EXISTS (SELECT *
                    FROM elbat t2
                    WHERE t2.user_id = t1.user_id
                    AND t2.activity_id = '1')
                    AND EXISTS (SELECT *
                    FROM elbat t2
                    WHERE t2.user_id = t1.user_id
                    AND t2.activity_id = '2')
                    AND NOT EXISTS (SELECT *
                    FROM elbat t2
                    WHERE t2.user_id = t1.user_id
                    AND t2.activity_id = '3');


                    If you have a user table, you might also want to join to that instead of retrieving the distinct user ID from the other table.







                    share|improve this answer












                    share|improve this answer



                    share|improve this answer










                    answered 11 hours ago









                    sticky bitsticky bit

                    1,733314




                    1,733314























                        0














                        You can use bool_or for this.



                        SELECT user_id
                        FROM users
                        GROUP BY user_id
                        HAVING bool_or(activity_id IN (1,2)) -- assumes '1, 2' mean '1 OR 2'
                        AND NOT bool_or(activity_id IN (3))





                        share|improve this answer




























                          0














                          You can use bool_or for this.



                          SELECT user_id
                          FROM users
                          GROUP BY user_id
                          HAVING bool_or(activity_id IN (1,2)) -- assumes '1, 2' mean '1 OR 2'
                          AND NOT bool_or(activity_id IN (3))





                          share|improve this answer


























                            0












                            0








                            0







                            You can use bool_or for this.



                            SELECT user_id
                            FROM users
                            GROUP BY user_id
                            HAVING bool_or(activity_id IN (1,2)) -- assumes '1, 2' mean '1 OR 2'
                            AND NOT bool_or(activity_id IN (3))





                            share|improve this answer













                            You can use bool_or for this.



                            SELECT user_id
                            FROM users
                            GROUP BY user_id
                            HAVING bool_or(activity_id IN (1,2)) -- assumes '1, 2' mean '1 OR 2'
                            AND NOT bool_or(activity_id IN (3))






                            share|improve this answer












                            share|improve this answer



                            share|improve this answer










                            answered 8 hours ago









                            Evan CarrollEvan Carroll

                            31.5k865210




                            31.5k865210






















                                Ximik is a new contributor. Be nice, and check out our Code of Conduct.










                                draft saved

                                draft discarded


















                                Ximik is a new contributor. Be nice, and check out our Code of Conduct.













                                Ximik is a new contributor. Be nice, and check out our Code of Conduct.












                                Ximik is a new contributor. Be nice, and check out our Code of Conduct.
















                                Thanks for contributing an answer to Database Administrators Stack Exchange!


                                • Please be sure to answer the question. Provide details and share your research!

                                But avoid



                                • Asking for help, clarification, or responding to other answers.

                                • Making statements based on opinion; back them up with references or personal experience.


                                To learn more, see our tips on writing great answers.




                                draft saved


                                draft discarded














                                StackExchange.ready(
                                function () {
                                StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fdba.stackexchange.com%2fquestions%2f227083%2fpostgresql-correct-way-for-subset-calculations%23new-answer', 'question_page');
                                }
                                );

                                Post as a guest















                                Required, but never shown





















































                                Required, but never shown














                                Required, but never shown












                                Required, but never shown







                                Required, but never shown

































                                Required, but never shown














                                Required, but never shown












                                Required, but never shown







                                Required, but never shown







                                Popular posts from this blog

                                Reichsarbeitsdienst

                                Tanganjiko

                                Norda sulo