- table of contents {
.toc}
0x00. Desciption
execute sql queies in SQLite in IMDB database.
0x01. Evn setup
I’m using Ubuntu20.04 VM for this homework. The VM is installed on VMware Workstation Pro 15. There’re several steps to prepare:
install SQLite,
sudo apt-get install sqlite3 libsqlite3-devDownload dataset,
wget https://15445.courses.cs.cmu.edu/fall2019/files/imdb-cmudb2019.db.gzverify dataset,
md5 imdb-cmudb2019.db.gzUnzip and execute it,
gunzip imdb-cmudb2019.db.gzandsqlite3 imdb-cmudb2019.db
0x02. Q1_SAMPLE
Sample, skip.
0x03. Q2_UNCOMMON_TYPE
List the longest title of each type along with the runtime minutes.
select type, primary_title, max(runtime_minutes) FROM titles GROUP BY type ORDER BY type ASC, primary_title ASC;output:
movie|Logistics|51420short|Kuriocity|461tvEpisode|Téléthon 2012|1800tvMiniSeries|Kôya no yôjinbô|1755tvMovie|ArtQuench Presents Spirit Art|2112tvSeries|The Sharing Circle|8400tvShort|Paul McCartney Backstage at Super Bowl XXXIX|60tvSpecial|Katy Perry Live: Witness World Wide|5760video|Midnight Movie Madness: 50 Movie Mega Pack|5135videoGame|Flushy Fish VR: Just Squidding Around|1500This is almost correct, however, in tvShort type, a tie exeist.
sqlite> select title_id,type, runtime_minutes FROM titles WHERE type = 'tvShort' ORDER BY runtime_minutes DESC LIMIT 5;tt2292857|tvShort|60tt5613498|tvShort|60tt10622020|tvShort|53tt0353158|tvShort|49tt5452108|tvShort|48Now the problem is the tie, I need to figure out a way to handle it(show both tuples when tie). So, I take the output of this sql as a temp table.
select type, primary_title, max(runtime_minutes) FROM titles GROUP BY type ORDER BY type ASC, primary_title ASC;And then, join it with titles table, on the same runtime_minutes and type, which means it will give me:
if max without tie, the join will output the same tuple,if max WITH tie, the join will output the tuples that has the same attributes* like the max, in this case, has the same runtime and type, which indicates it's a tie.As for the answer:
select titles.type, titles.primary_title, titles.runtime_minutes from titlesjoin (select type, primary_title, max(runtime_minutes) as maxLength FROM titles GROUP BY type) as maxTypeon maxType.maxLength = titles.runtime_minutes and maxType.type = titles.typeorder by titles.type ASC, titles.primary_title ASC;movie|Logistics|51420short|Kuriocity|461tvEpisode|Téléthon 2012|1800tvMiniSeries|Kôya no yôjinbô|1755tvMovie|ArtQuench Presents Spirit Art|2112tvSeries|The Sharing Circle|8400tvShort|Paul McCartney Backstage at Super Bowl XXXIX|60tvShort|The People Next Door|60tvSpecial|Katy Perry Live: Witness World Wide|5760video|Midnight Movie Madness: 50 Movie Mega Pack|5135videoGame|Flushy Fish VR: Just Squidding Around|15000x04. Q3_TV_VS_MOVIE
List all types of titles along with the number of associated titles.
SELECT type, count(distinct title_id) as number from titles GROUP BY type ORDER BY number;tvShort|4075videoGame|9044tvSpecial|9107tvMiniSeries|10291tvMovie|45431tvSeries|63631video|90069movie|197957short|262038tvEpisode|1603076Answer from prof:
SELECT type, count(*) AS title_count FROM titles GROUP BY type ORDER BY title_count ASC;So, at this time that I figured title_id is primary key and it’s unique, so distinct is redundant. It could be replaced by count(title_id) or simplily count(*).
0x05. Q4_OLD_IS_NOT_GOLD
Which decades saw the most number of titles getting premiered? List the number of titles in every decade. Like 2010s|2789741.
From what I could do, I got this:
SELECT CAST(premiered/10 AS INT)*10 as upTime, count(*) FROM titles WHERE premiered is not NULL GROUP BY upTime ORDER BY upTime DESC;2020|24922010|10507322000|4946391990|2114531980|1192581970|997071960|752371950|395541940|100111930|114921920|131531910|265961900|95861890|22861880|221870|1I have no idea how to append ‘s’ at the end of each decade. So, I look for the answer.
SELECT CAST(premiered/10*10 AS TEXT) || 's' AS decade, COUNT(*) AS num_movies FROM titles WHERE premiered is not null GROUP BY decade ORDER BY num_movies DESC ;2010s|10507322000s|4946391990s|2114531980s|1192581970s|997071960s|752371950s|395541910s|265961920s|131531930s|114921940s|100111900s|95862020s|24921890s|22861880s|221870s|1I cast the decades into INT instead of TEXT which could not use || to append someting else. And sort this by number of movies(I sorted by the wrong collum before) Now I change my sql to this:
SELECT CAST(premiered/10*10 AS TEXT) || 's' as upTime, count(*) as numbers FROM titles WHERE premiered is not NULL GROUP BY upTime ORDER BY numbers DESC;2010s|10507322000s|4946391990s|2114531980s|1192581970s|997071960s|752371950s|395541910s|265961920s|131531930s|114921940s|100111900s|95862020s|24921890s|22861880s|221870s|1And now, the output is exactly the same.
0x06. Q5_PERCENTAGE
List the decades and the percentage of titles which premiered in the corresponding decade. Display like : 2010s|45.7042.
SELECT CAST(premiered/10*10 AS TEXT) || 's' as upTime, ROUND(CAST(count(*) AS FLOAD) / (SELECT COUNT(*) FROM titles) * 100, 4) as numbersFROM titles WHERE premiered is not NULL GROUP BY upTime ORDER BY numbers DESC;2010s|45.78912000s|21.55551990s|9.21481980s|5.19711970s|4.34511960s|3.27871950s|1.72371910s|1.1591920s|0.57321930s|0.50081940s|0.43631900s|0.41772020s|0.10861890s|0.09961880s|0.0011870s|0.0This is done based on the previous question, just cast the number to fload and divide. The answer cast it into real. They are both proximity data types, so in this question, there no difference.
If you need accurate data, numeric and decimal should be better.
