how do I add another lyrics site?
![]() ![]() |
Hi, I want to add the Hebrew lyrics site Shiron.net to the script. Can someoe tell me how to do it? |
![]() |
All the informations to extract the lyrics from a site are saved in a javascript object. You can see the whole collection of all the sites here:
{id:0, //Identifier of the site, it's an incremental integer
name:"lyricwiki.org", //the name of the site, displayed for example in the dropdown
enabled:true, //if this site is enabled
searchEncoding:"", //specify an encoding for the xmlhttprequest for the search page, for example utf-8
searchUrl:"http://www.google.com/search?num=20&q=site%3Alyricwiki.org+{0}", //the url of the search page, the {0} will be replaced with the terms to search
parseListRegexp:"<a.*?href=\"(http://lyricwiki\\.org/[^/?]*?:[^/]*?)\".*?>(.*?)(?: - lyrics from.*?)?</a>", //the regular expression to extract the links and their text from the search page
lyricsUrl:"", //If needed will be added a prefix to the extracted links
lyricsEncoding:"", //Specify an encoding for the xmlhttprequest for the lyrics page
parseLyrics:[{ //an array of regular expression pairs
parseTitleRegexp:"<title>(.*?) - Lyrics from LyricWiki</title>", //the regular expression to extract the title
parseLyricsRegexp:"<div class='lyricbox' >((?:.|\\s)*?)<p>" //the regual expression to extract the lyrics
}]}
|
![]() ![]() |
I still dont get it. I never wrote anything in java so can anybody give me an explanation for noobies? With animelyrics.com as example please ^^. Edit: Thank you a lot for this useful tutorial. Keep up the good work. |
![]() |
Animelyrics is not really good site to extract lyrics because of their html layout and other details, but if you want you can try. You will need to open Youtube Lyrics option panel and scroll down to the lyrics site definitions and choose to edit them. Scroll at the end of the text box and add a new "lyric site definition": Change this:
{id:12,
name:"lyricsbay.com",
...
parseLyricsRegexp:"<div id=EchoTopic>\\s+((?:.|\\s)*?)<\\/div>"}]}]
to (pay attention to the final square bracket! you must insert the new definition before it, exactly between the last curly bracket and the last square bracket and add a comma)
{id:12,
name:"lyricsbay.com",
...
parseLyricsRegexp:"<div id=EchoTopic>\\s+((?:.|\\s)*?)<\\/div>"}]},
{id:99,
name:"animelyrics.com",
enabled:true,
searchEncoding:"",
searchUrl:"",
parseListRegexp:"",
lyricsUrl:"",
lyricsEncoding:"",
parseLyrics:[]}]
Press the apply or ok button, if you have added it correctly you will receive no errors otherwise an alert popup will show the javascript error message. Otherwise now animelyrics is a valid lyrics site, but you have to activate it in the lyrics site drop down box. Look in configuration panel under "Lyrics Sites" and you will see at the end of the list the new lyrics site animelyrics, check the checkbox beside it. Now animelyrics will also appear in the lyrics sites drop down. The first problem with animelyrics is that their search engine sucks or better is not friendly because it doesn't returns all the result in a single page but you have to follow their links. So first don't use their search engine and use google to search their pages. Lucky a lot of lyrics site definitions already uses google as search engine, so it's a matter of copy paste:
{id:99,
name:"animelyrics.com",
enabled:true,
searchEncoding:"",
searchUrl:"http://www.google.com/search?num=20&q=site%3Aanimelyrics.com+{0}",
parseListRegexp:"",
lyricsUrl:"",
lyricsEncoding:"",
parseLyrics:[]}]
Then you have to give the regular expression to extract the links from google search page. If you don't know what regular expressions are don't feel dumb, I estimate that over 90% of the programmers don't know them. In few words regular expression are search patterns to retrieve information inside a text. So you make a request to google and you receive back a nice html text, which you have to parse with the help of regular expressions. Go to google web search page and type in the search field:".hack site:animelyrics.com". You will see that some of the results aren't lyrics pages, all the lyrics pages end with ".htm", good you can refine your search on google with:".hack site:animelyrics.com link:.htm". Now you receive only lyrics pages. From the whole page you need only the links to animelyrics, here the regexp (regular expression) to extract them: "<a[^>]*?href=\"(http://www\\.animelyrics\\.com/[^\"]*?)\"[^>]*?>(.*?)</a>" The regexp has two captures, round brackets, the first one is the link the second one is the text inside the link. Now animelyrics titles all their pages with "Anime Lyrics dot Com - bla bla bla", the first part is completely useless and you can remove it as follow: "<a[^>]*?href=\"(http://www\\.animelyrics\\.com/[^\"]*?)\"[^>]*?>[^-]*?-\\s*(.*?)</a>" Which skips all the chars until the first dash. So your lyrics site definitions will became:
{id:99,
name:"animelyrics.com",
enabled:true,
searchEncoding:"",
searchUrl:"http://www.google.com/search?num=20&q=site%3Aanimelyrics.com+link%3A.htm+{0}",
parseListRegexp:"<a[^>]*?href=\"(http://www\\.animelyrics\\.com/[^\"]*?)\"[^>]*?>[^-]*?-\\s*(.*?)</a>",
lyricsUrl:"",
lyricsEncoding:"",
parseLyrics:[]}]
You don't need any searchEncoding because google is sending it correctly and you don't need any lyricsUrl because the extracted links contain already the domain. Now you need the regexp to parse the lyrics, which sometimes are placed inside a html pre tag but when there is a translation their are placed inside a html table which make the whole more complicated. For the first case is simple. To parse the lyrics pages you need two regexps, one for the title and the second for the lyrics. Normally you can get the title from the html title tag and as already said the lyrics are inside an html pre tag with css class lyrics.
{
parseTitleRegexp:"<title>[^-]*?- ([\\s\\S]*?)</title>",
parseLyricsRegexp:"<pre class=lyrics>([\\s\\S]*?)</pr[e]>"
}
(The Now add this to the lyrics side definition and you should have the following result:
{id:99,
name:"animelyrics.com",
enabled:true,
searchEncoding:"",
searchUrl:"http://www.google.com/search?num=20&q=site%3Aanimelyrics.com+link%3A.htm+{0}",
parseListRegexp:"<a[^>]*?href=\"(http://www\\.animelyrics\\.com/[^\"]*?)\"[^>]*?>[^-]*?-\\s*(.*?)</a>",
lyricsUrl:"",
lyricsEncoding:"",
parseLyrics:[{parseTitleRegexp:"<title>[^-]*?- ([\\s\\S]*?)</title>",
parseLyricsRegexp:"<pre class=lyrics>([\\s\\S]*?)</pr[e]>"}]}]
But sometimes they use a table with two columns to show the translation, what is really bad to parse, at least if you want to extract only the text without any html formating tag. If you don't care to have html formating tags you can extract the whole table as follow:
{parseTitleRegexp:"<title>[^-]*?- ([\\s\\S]*?)</title>",
parseLyricsRegexp:"<table border=0 cellspacing=0>[\\s\\S]*?</table>"}
At the end you will have the following lyrics site definition:
{id:99,
name:"animelyrics.com",
enabled:true,
searchEncoding:"",
searchUrl:"http://www.google.com/search?num=20&q=site%3Aanimelyrics.com+link%3A.htm+{0}",
parseListRegexp:"<a[^>]*?href=\"(http://www\\.animelyrics\\.com/[^\"]*?)\"[^>]*?>[^-]*?-\\s*(.*?)</a>",
lyricsUrl:"",
lyricsEncoding:"",
parseLyrics:[{parseTitleRegexp:"<title>[^-]*?- ([\\s\\S]*?)</title>",
parseLyricsRegexp:"<table border=0 cellspacing=0>[\\s\\S]*?</table>"}, {parseTitleRegexp:"<title>[^-]*?- ([\\s\\S]*?)</title>",
parseLyricsRegexp:"<pre class=lyrics>([\\s\\S]*?)</pr[e]>"}]}]
|
![]() ![]() |
Wow, this post is quite impressive. I was just about to ask for lyrics.wikia update, but maybe I'd try it myself, it shouldn't be hard to parse (lyrics encoding in the source is gross, but it shouldn't be a problem). Thanks! |


