SQL: need to outer join, but don't want to increase result size

Started by Disch, April 24, 2017, 08:32:04 PM

Previous topic - Next topic

Disch

Suppose I have two tables.

Table A:

user_id | date_start | date_end
=================================
1       | 2015-01-01 | 2015-01-20
1       | 2016-12-01 | 2016-12-20
2       | 2016-12-01 | 2016-12-20


Table B:

user_id | event_date | other_info
=================================
1       | 2016-11-05 | foo
1       | 2016-12-05 | bar
1       | 2016-12-15 | baz


I'm doing a query for a specific user on Table 1:


SELECT * FROM A WHERE user_id=1


Logically in this case, I'll get 2 results:


user_id | date_start | date_end
=================================
1       | 2015-01-01 | 2015-01-20
1       | 2016-12-01 | 2016-12-20


Now I need to tie the result of that query to the 'other_info' in table B, where 'event_date' falls between the date range in table A.

This alone is easy enough to do with a join -- but the catch is I always want the number of results to match what I would get for the base query.  I don't want the join to add/remove results.  So in this case, I'd want EXACTLY TWO RESULTS.  In the case that there are multiple matches, I only want the earliest date.


What I want is in the vein of this:

SELECT * FROM A
LEFT OUTER JOIN B ON (event_date >= date_start
                  AND event_date < date_end
                  AND A.user_id = B.user_id)


The problem is, that query will give me the below result:


user_id | date_start | date_end    | event_date | other_info
=============================================================
1       | 2015-01-01 | 2015-01-20  | null       | null
1       | 2016-12-01 | 2016-12-20  | 2016-12-05 | bar
1       | 2016-12-01 | 2016-12-20  | 2016-12-15 | baz


What I want is this:

user_id | date_start | date_end    | event_date | other_info
=============================================================
1       | 2015-01-01 | 2015-01-20  | null       | null
1       | 2016-12-01 | 2016-12-20  | 2016-12-05 | bar


(note it's important that I get 'bar' and not 'baz' -- since bar has the earliest event date within the given date range)


Is there a simple way to get this?

Any help appreciated.  I've been googling for an answer and can't find anything directly on point.  Thanks.


(EDIT:  changed the example to get 2 hits with the base query, as that does a better job illustrating what I need)


abw

You haven't mentioned anything about which RDBMS you're using, any applicable table constraints, or how you'd like to resolve e.g. duplicate rows in A or rows in B with the same (user_id, event_date) and different (other_info), but here's a query to consider (tested on Oracle 12c):

select user_id, date_start, date_end, event_date, other_info
from (
  select a.user_id, a.date_start, a.date_end, b.event_date, b.other_info, row_number() over (partition by a.user_id, a.date_start, a.date_end order by b.event_date) as rnum
  from a left outer join b on (event_date >= date_start and event_date < date_end and a.user_id = b.user_id)
) where rnum = 1;


The basic idea is to take the query you already had, number the rows based on your criteria, and then only select the rows with the right number. You can easily extend this to stabilize which row of B is selected by e.g.

... row_number() over (partition by a.user_id, a.date_start, a.date_end order by b.event_date, b.other_info nulls last) as rnum

instead, or if you wanted the first 2 rows for each (user_id, date_start, date_end) you could do

) where rnum < 3;


Avicalendriya


Disch

Thanks everyone.  I'm still pretty fuzzy on SQL and am doing this for a school/job thing.  I'm not going back "on the clock" until tomorrow afternoon so I won't be putting effort into this until then.  So mostly my response is "thanks, I'll try that out tomorrow".

---------------------
@assassin:  Thanks for the links!

--------------------------
@abw:  Whoops, sorry.  I'm working with PostgreSQL, but I have no idea what version.  All other relevant information is presented (no other conflicts are of concern).

As for your query, I'll have to futz around with it.  I'm extremely fuzzy on how partition works, but I'm sure I can piece it together.

--------------------------
@Avicalendriya:  If I LIMIT 1, won't that limit the results of the entire query?  Pretty sure that's not what I want (in the example in my OP -- I'd need 2 results), but thanks anyway.

abw

I'm not really familiar with PostgreSQL, but a quick check with Google suggests that row_number() should work the same in PostgreSQL as in Oracle (I only glossed over it, but http://www.postgresqltutorial.com/postgresql-row_number/ seems to give a decent tutorial on row_number() specifically; the same concepts apply to other partitioning/windowing functions). Since this is a learning exercise, I'll add that before being introduced to the glories of analytic SQL functions, I found myself writing a lot of queries like this:

select a.user_id, a.date_start, a.date_end, b.event_date, min(b.other_info)
from a left outer join b on (event_date >= date_start and event_date < date_end and a.user_id = b.user_id)
where b.event_date is null or (a.user_id, a.date_start, a.date_end, b.event_date) in (
  select a.user_id, a.date_start, a.date_end, min(b.event_date)
  from a left join b on event_date >= date_start and event_date < date_end and a.user_id = b.user_id
  group by a.user_id, a.date_start, a.date_end
)
group by a.user_id, a.date_start, a.date_end, b.event_date;

with the basic idea being to find the minimum event_date within each (user_id, date_start, date_end) and then re-query B to find the (other_info) that matches (user_id, event_date). It gets the job done, but performance becomes an issue once you start working with large data sets.

On the other hand, as Avicalendriya implies, if you know beforehand that you're only going to be looking for a particular user_id, you might want to write a query that avoids computing results for other user_ids in order to achieve additional performance benefits.

Disch

Excellent, abw, that tutorial page is exactly what I needed.

So it seems like I'll want to do a row_number on a partition by user_id, start_date, end_date, and order by event_date ... then in the query only take rows WHERE row_number=1.

I'm going to tinker with this for a bit and will report back.

Thanks for all the help, guys.

EDIT:

Got it!

This query appears to be working:


SELECT * FROM
(
SELECT *, ROW_NUMBER() OVER(
PARTITION BY ( a.id, date_start, date_end)
ORDER BY event_date ASC
)
FROM a
LEFT OUTER JOIN b ON (
event_date >= date_start
AND event_date < date_end
AND a.id = b.id
)
WHERE
a.id=1
) x
WHERE row_number=1



Thanks again guys.   :beer: