CMU 15-445/645 - Homework assignment #1 SQL

table of contents {.toc}

0x00. Desciption#

execute sql queies in SQLite in IMDB database.

0x01. Evn setup#

I’m using Ubuntu20.04 VM for this homework. The VM is installed on VMware Workstation Pro 15. There’re several steps to prepare:

install SQLite, sudo apt-get install sqlite3 libsqlite3-dev
Download dataset, wget https://15445.courses.cs.cmu.edu/fall2019/files/imdb-cmudb2019.db.gz
verify dataset, md5 imdb-cmudb2019.db.gz
Unzip and execute it, gunzip imdb-cmudb2019.db.gz and sqlite3 imdb-cmudb2019.db

0x02. Q1_SAMPLE#

Sample, skip.

0x03. Q2_UNCOMMON_TYPE#

List the longest title of each type along with the runtime minutes.

1
select type, primary_title, max(runtime_minutes) FROM titles GROUP BY type ORDER BY type ASC, primary_title ASC;

output:

1
movie|Logistics|51420
2
short|Kuriocity|461
3
tvEpisode|Téléthon 2012|1800
4
tvMiniSeries|Kôya no yôjinbô|1755
5
tvMovie|ArtQuench Presents Spirit Art|2112
6
tvSeries|The Sharing Circle|8400
7
tvShort|Paul McCartney Backstage at Super Bowl XXXIX|60
8
tvSpecial|Katy Perry Live: Witness World Wide|5760
9
video|Midnight Movie Madness: 50 Movie Mega Pack|5135
10
videoGame|Flushy Fish VR: Just Squidding Around|1500

This is almost correct, however, in tvShort type, a tie exeist.

1
sqlite> select title_id,type, runtime_minutes FROM titles WHERE type = 'tvShort'  ORDER BY runtime_minutes DESC LIMIT 5;
2
tt2292857|tvShort|60
3
tt5613498|tvShort|60
4
tt10622020|tvShort|53
5
tt0353158|tvShort|49
6
tt5452108|tvShort|48

Now the problem is the tie, I need to figure out a way to handle it(show both tuples when tie). So, I take the output of this sql as a temp table.

1
select type, primary_title, max(runtime_minutes) FROM titles GROUP BY type ORDER BY type ASC, primary_title ASC;

And then, join it with titles table, on the same runtime_minutes and type, which means it will give me:

1
if max without tie, the join will output the same tuple,
2
if max WITH tie, the join will output the tuples that has the same attributes* like the max, in this case, has the same runtime and type, which indicates it's a tie.

As for the answer:

1
select  titles.type, titles.primary_title, titles.runtime_minutes from titles
2
join (select type, primary_title, max(runtime_minutes) as maxLength FROM titles GROUP BY type) as maxType
3
on maxType.maxLength = titles.runtime_minutes and maxType.type = titles.type
4
order by titles.type ASC, titles.primary_title ASC;

1
movie|Logistics|51420
2
short|Kuriocity|461
3
tvEpisode|Téléthon 2012|1800
4
tvMiniSeries|Kôya no yôjinbô|1755
5
tvMovie|ArtQuench Presents Spirit Art|2112
6
tvSeries|The Sharing Circle|8400
7
tvShort|Paul McCartney Backstage at Super Bowl XXXIX|60
8
tvShort|The People Next Door|60
9
tvSpecial|Katy Perry Live: Witness World Wide|5760
10
video|Midnight Movie Madness: 50 Movie Mega Pack|5135
11
videoGame|Flushy Fish VR: Just Squidding Around|1500

0x04. Q3_TV_VS_MOVIE#

List all types of titles along with the number of associated titles.

1
SELECT type, count(distinct title_id) as number from titles GROUP BY type ORDER BY number;

1
tvShort|4075
2
videoGame|9044
3
tvSpecial|9107
4
tvMiniSeries|10291
5
tvMovie|45431
6
tvSeries|63631
7
video|90069
8
movie|197957
9
short|262038
10
tvEpisode|1603076

Answer from prof:

1
SELECT type, count(*) AS title_count FROM titles GROUP BY type ORDER BY title_count ASC;

So, at this time that I figured title_id is primary key and it’s unique, so distinct is redundant. It could be replaced by count(title_id) or simplily count(*).

0x05. Q4_OLD_IS_NOT_GOLD#

Which decades saw the most number of titles getting premiered? List the number of titles in every decade. Like 2010s|2789741.

From what I could do, I got this:

1
SELECT CAST(premiered/10 AS INT)*10 as upTime, count(*) FROM titles WHERE premiered is not NULL GROUP BY upTime ORDER BY upTime DESC;

1
2020|2492
2
2010|1050732
3
2000|494639
4
1990|211453
5
1980|119258
6
1970|99707
7
1960|75237
8
1950|39554
9
1940|10011
10
1930|11492
11
1920|13153
12
1910|26596
13
1900|9586
14
1890|2286
15
1880|22
16
1870|1

I have no idea how to append ‘s’ at the end of each decade. So, I look for the answer.

1
SELECT
2
  CAST(premiered/10*10 AS TEXT) || 's' AS decade,
3
  COUNT(*) AS num_movies
4
  FROM titles
5
  WHERE premiered is not null
6
  GROUP BY decade
7
  ORDER BY num_movies DESC
8
  ;

1
2010s|1050732
2
2000s|494639
3
1990s|211453
4
1980s|119258
5
1970s|99707
6
1960s|75237
7
1950s|39554
8
1910s|26596
9
1920s|13153
10
1930s|11492
11
1940s|10011
12
1900s|9586
13
2020s|2492
14
1890s|2286
15
1880s|22
16
1870s|1

I cast the decades into INT instead of TEXT which could not use || to append someting else. And sort this by number of movies(I sorted by the wrong collum before) Now I change my sql to this:

1
SELECT CAST(premiered/10*10 AS TEXT) || 's' as upTime, count(*) as numbers FROM titles WHERE premiered is not NULL GROUP BY upTime ORDER BY numbers DESC;

1
2010s|1050732
2
2000s|494639
3
1990s|211453
4
1980s|119258
5
1970s|99707
6
1960s|75237
7
1950s|39554
8
1910s|26596
9
1920s|13153
10
1930s|11492
11
1940s|10011
12
1900s|9586
13
2020s|2492
14
1890s|2286
15
1880s|22
16
1870s|1

And now, the output is exactly the same.

0x06. Q5_PERCENTAGE#

List the decades and the percentage of titles which premiered in the corresponding decade. Display like : 2010s|45.7042.

1
SELECT CAST(premiered/10*10 AS TEXT) || 's' as upTime, ROUND(CAST(count(*) AS FLOAD) / (SELECT COUNT(*) FROM titles) * 100, 4) as numbers
2
FROM titles WHERE premiered is not NULL GROUP BY upTime ORDER BY numbers DESC;

1
2010s|45.7891
2
2000s|21.5555
3
1990s|9.2148
4
1980s|5.1971
5
1970s|4.3451
6
1960s|3.2787
7
1950s|1.7237
8
1910s|1.159
9
1920s|0.5732
10
1930s|0.5008
11
1940s|0.4363
12
1900s|0.4177
13
2020s|0.1086
14
1890s|0.0996
15
1880s|0.001
16
1870s|0.0

This is done based on the previous question, just cast the number to fload and divide. The answer cast it into real. They are both proximity data types, so in this question, there no difference.

If you need accurate data, numeric and decimal should be better.