Data Processing & Cleaning
Concatenating the scraped data files
As the scraping was not completed in one time (it took several hours!), I executed three runs of the code, resulting in 3 data files. These data files were concatenated on the command line on MacOS using the following code.
- Resulting files from scraping of u2gigs.com:
- u2analytics_data_shows1-234.csv
- u2analytics_data_shows235-298.csv
- u2analytics_data_shows299-2307.csv
- Now combine everything into one data file:
- First create new file with header:
head -n 1 u2analytics_data_shows1-234.csv > u2data_all_shows.csv
- Then concatenate content from 3 data files into the new file (skipping headers):
tail -n +2 -q u2analytics_data_shows*.csv >> u2data_all_shows.csv
- First create new file with header:
- Check number of lines in original 3 files and final data file:
wc -l u2analytics_data_shows*
- 2980 u2analytics_data_shows1-234.csv
- 1176 u2analytics_data_shows235-298.csv
- 35222 u2analytics_data_shows299-2307.csv
- 39378 total
wc -l u2data_all_shows.csv
- 39376 u2data_all_shows.csv
- Difference is the header lines -> correct
Importing the data into R for further curation
The data curation was an iterative process of checking the data, identifying issues, fixing the issues followed by another round of curation. The R code chunk below shows the multiple iterations that were done, followed by paragraphs explaining the issues that were observed and fixed in these iterations.
Fix problems in song_lyrics field
When reading the data file with read_csv() a problems attribute was created with 979 rows. There were many rows from the file that have an unexpected number of columns. I manually inspected a sample of these lines in the original data file.
Then I checked three lines that give a problem with read_csv, focusing on the song_lyrics field:
sed '2981!d' u2data_all_shows.csv
sed '1150!d' u2data_all_shows.csv
sed '115!d' u2data_all_shows.csv
I found different issues in the song_lyrics field:
- ““;; at the end of the line instead of a single”
- ” escaped quote characters in the song_lyrics field
I used the following awk commands on MacOS to fix these two issues:
awk '{gsub(/\\"/,"")}1' u2data_all_shows.csv > u2data_all_shows_fixquotes.csv
awk '{gsub(/";;/,"")}1' u2data_all_shows_fixquotes.csv > u2data_all_shows_fixquotes-semicolons.csv
Fix invalid date issue
In the file 18 lines were found where an invalid date was present with 00 for the day. I used a MacOS command line to fix this on the data file:
grep '\-00' u2data_all_shows_fixquotes-semicolons.csv | wc -l
18
awk '{gsub(/\-00/,"-01")}1' u2data_all_shows_fixquotes-semicolons.csv > u2data_all_shows_fixquotes-semicolons-dates.csv
grep '\-00' u2data_all_shows_fixquotes-semicolons-dates.csv | wc -l
0
Fix additional issues in the song_lyrics field
File was read again with read_csv() and problems() showed that there were still 30 lines with a problem. Here the most likely culprits are again in the song_lyrics field.
- an extra ” at the end of the line
Fix the extra ” at the end of the line:
grep '""\r' u2data_all_shows_fixquotes-semicolons-dates.csv | wc -l
8
awk '{gsub(/""\r/,"\\\"\r")}1' u2data_all_shows_fixquotes-semicolons-dates.csv > u2data_all_shows_fixquotes-semicolons-dates-others.csv
grep '""\r' u2data_all_shows_fixquotes-semicolons-dates-others.csv | wc -l
0
Loading the file with read_csv() shows that now the file still has 23 problem lines. Inspecting some problem lines, I observed the following potential issues in the song_lyrics field:
- “;”
- ““; at the end of the line Fixed this in the following manner:
grep '";"' u2data_all_shows_fixquotes-semicolons-dates-others.csv | wc -l
22
grep '"";\r' u2data_all_shows_fixquotes-semicolons-dates-others.csv | wc -l
22
awk '{gsub(/";"/,"")}1' u2data_all_shows_fixquotes-semicolons-dates-others.csv > u2data_all_shows_fixquotes-semicolons-dates-others-2.csv
awk '{gsub(/"";\r/,"\\\"\r")}1' u2data_all_shows_fixquotes-semicolons-dates-others-2.csv > u2data_all_shows_fixquotes-semicolons-dates-others-3.csv
grep '";"' u2data_all_shows_fixquotes-semicolons-dates-others-3.csv | wc -l
0
grep '"";\r' u2data_all_shows_fixquotes-semicolons-dates-others-3.csv | wc -l
0
Loading the file with read_csv() shows that now the file still has 1 problem line. I observed that two lines were concatenated, meaning that the carriage return character was missing. Here is the part of the line that shows this:
sed '4156!d' u2data_all_shows_fixquotes-semicolons-dates-others-3.csv
w I will follow... "299,"U2 Unforgettable Fire Tour","T
Fixed this in the following manner:
grep '"299,' u2data_all_shows_fixquotes-semicolons-dates-others-3.csv | wc -l
1
awk '{gsub(/"299,/,"\"\n299,")}1' u2data_all_shows_fixquotes-semicolons-dates-others-3.csv > u2data_all_shows_fixquotes-semicolons-dates-others-4.csv
grep '"299,' u2data_all_shows_fixquotes-semicolons-dates-others-4.csv | wc -l
1
sed '4156!d' u2data_all_shows_fixquotes-semicolons-dates-others-4.csv
sed '4157!d' u2data_all_shows_fixquotes-semicolons-dates-others-4.csv
Data cleaning result
Reading the last fixed file with read_csv() gave 0 errors and the following info:
Rows: 39376 Columns: 15
── Column specification ──────────────────────────────────────────────────────────────────────────
Delimiter: ","
chr (10): tour, leg, venue, city, state, country, song_title, show_url, song_url, song_lyrics
dbl (2): showID, song_position
lgl (2): snippet, encore
date (1): date
I stored this fixed data file with a better file name:
cp u2data_all_shows_fixquotes-semicolons-dates-others-4.csv u2data_all_shows_clean.csv