среда, 4 февраля 2015 г.

Integrate R with LiveJournal

Introduction


LiveJournal is a network service where Internet users can keep their blogs or diaries, organizy communities, etc. It contains a lot of user journals and blog entries from all over the world. In this article I would like to describe how to access this date in R by using LiveJournal API's.


LiveJournal API's

LiveJournal API's documentation is available here. In short, there are two main client interfaces - Flat Protocol and XML-RPC Protocol. In Flat Protocol client interacts with LJ server by using HTTP GET requests and provides input parameters as key / value pairs in request string. In XML-RPC protocol client communicates with LJ server by using HTTP POST requests and provides request parameters in form of XML documents. In general, both protocols exposes quite similar set of functions.
  • checkfriends - Checks to see if your Friends list has been updated since a specified time.
  • consolecommand - Run an administrative command.
  • editevent - Edit or delete a user's past journal entry
  • editfriendgroups - Edit the user's defined groups of friends.
  • editfriends - Add, edit, or delete friends from the user's Friends list
  • friendof - Returns a list of which other LiveJournal users list this user as their friend. 
  • getchallenge - Generate a server challenge string for authentication.
  • getdaycounts - This mode retrieves the number of journal entries per day.
  • getevents - Download parts of the user's journal.
  • getfriends - Returns a list of which other LiveJournal users this user lists as their friend. 
  • getfriendgroups - Retrieves a list of the user's defined groups of friends. 
  • getusertags - Retrieves a list of the user's defined tags.
  • login - validate user's password and get base information needed for client to function 
  • postevent - This is how a user actually submits a new log entry to the server. 
  • sessionexpire - Expires session cookies. 
  • sessiongenerate - Generate a session cookie. 
  • syncitems - Returns a list of all the items that have been created or updated for a user.
Let's review some of these functions in more details. Namely, I am going to concentrate further in the article on the following three functions which allow to browse content of each user journal:
  • Authentication.
  • List journal entries for a specific user.
  • Get content of a journal entry.
Authentication. LiveJournal API provides several authentication methods - password hash, challenge / response, session cookies. I will describe the challenge / response method here in details and you may check out documentation for the other methods. So, in order to authenticate in LJ by using challenge / response method you need:

  1. Send getchallenge request. LJ will reply xml with challenge string.
  2. Form response string as MD5(challenge + MD5(user_password)) where MD5() means function which calculates MD5 hash from given string and "+" means string concatenation.

You can calculate MD5 hash in R by using digest() function from package digest:
digest(paste(challenge, password_hash, sep = ""), "md5", serialize = F)
Retrieved response string should be further pasted in XML-RPC request which requires user authentication.

Only note that usually response has quite short expiration time, so you cannot use the same response all the time and should re-generate it periodically.

Please refer to the function lj_auth() below to see how you can perform steps above in R. It requires user name and pre-computed user password hash (so you don't need to specify your password in clear text in code) as input parameters.


List journal entries for a specific user. I use getevents function from LJ XML-RPC API. This function allows to get information about fixed count of journal entries (100 items at most) starting from the given sync date. Then this function can be called in cycle, starting from the earliest date and moving to the latest one. Results are returned in the form of data table. Please refer to the function lj_list() below. It requires following input parameters - target journal name, user name and user password hash.

Get content of a journal entry. I use quite simple approach here - just retrieve journal entry content via LJ RSS interface by using GET request. Please refer to the function lj_entry() below. This method doesn't require authorization for public posts.

Code


lj_auth = function(user_name, password_hash) {
 
  require(httr)
  require(XML)
  require(digest)
 
  getchallenge_xml_rpc = '<?xml version="1.0"?>
    <methodCall>
      <methodName>LJ.XMLRPC.getchallenge</methodName>
      <params>
        <param>
          <value><struct></struct></value>
        </param>
      </params>
    </methodCall>
  '
 
  challenge_response = POST(
    "http://www.livejournal.com/interface/xmlrpc",
    accept_xml(),
    add_headers("Content-Type" = "text/xml"),
    body = getchallenge_xml_rpc
  )
 
  challenge = xpathApply(content(challenge_response), '//member[name="challenge"]/value/string', xmlValue)[[1]]
 
  response = digest(paste(challenge, password_hash, sep = ""), "md5", serialize = F)
 
  return(list(challenge = challenge, response = response))
}
 
lj_list = function(use_journal, user_name, password_hash, show_progress = T) {
 
  require(httr)
  require(XML)
  require(data.table)
 
  entries = data.table(itemid = integer(0), subject = character(0), url = character(0), reply_count = integer(0), eventtime = character(0))
 
  last_sync = ""
 
  if (show_progress) cat("start reading...\n")
 
  repeat {
 
    auth = lj_auth(user_name, password_hash)
 
    xml_rpc = sprintf('<?xml version="1.0"?>
      <methodCall>
        <methodName>LJ.XMLRPC.getevents</methodName>
        <params>
          <param>
            <value>
              <struct>
                <member><name>username</name><value><string>%s</string></value></member>
                <member><name>auth_method</name><value><string>challenge</string></value></member>
                <member><name>auth_challenge</name><value><string>%s</string></value></member>
                <member><name>auth_response</name><value><string>%s</string></value></member>
                <member><name>ver</name><value><int>1</int></value></member>
                <member><name>truncate</name><value><int>50</int></value></member>
                <member><name>selecttype</name><value><string>syncitems</string></value></member>
                <member><name>howmany</name><value><int>200</int></value></member>
                <member><name>noprops</name><value><boolean>1</boolean></value></member>
                <member><name>lineendings</name><value><string>unix</string></value></member>
                <member><name>usejournal</name><value><string>%s</string></value></member>
                <member><name>lastsync</name><value><string>%s</string></value></member>
              </struct>
            </value>
          </param>
        </params>
      </methodCall>', user_name, auth$challenge, auth$response, use_journal, last_sync
    )
 
    response = POST(
      "http://www.livejournal.com/interface/xmlrpc",
      accept_xml(),
      add_headers("Content-Type" = "text/xml"),
      body = xml_rpc
    )
 
    xdoc = content(response)
 
    itemid = as.integer(xpathApply(xdoc, '//member[name="ditemid"]/value/int', xmlValue))
    subject = as.character(xpathApply(xdoc, '//member[name="subject"]/value', xmlValue))
    reply_count = as.integer(xpathApply(xdoc, '//member[name="reply_count"]/value/int', xmlValue))
    url = as.character(xpathApply(xdoc, '//member[name="url"]/value/string', xmlValue))
    eventtime = as.character(xpathApply(xdoc, '//member[name="eventtime"]/value/string', xmlValue))
 
    if (length(url) == 0) break
 
    last_sync = as.character(as.POSIXct(max(eventtime)) + 1)
    if (show_progress) cat("read entries", length(url), "last sync time", last_sync, "\n")
 
    entries = rbind(entries, data.table(itemid = itemid, subject = subject, url = url, reply_count = reply_count, eventtime = eventtime))
  }
 
  return(entries)
}
 
lj_entry = function(use_journal, itemid) {
 
  require(httr)
  require(XML)
 
  url = sprintf("http://%s.livejournal.com/data/rss/?itemid=%d", use_journal, itemid)
  response = GET(url)
 
  text = xpathApply(content(response), "//item/description", xmlValue)
  if (length(text) == 0) return("")
 
  return(text[[1]])
}

Example



lj_list("<put journal name to read here>", "<put your lj user name here>", "<put your lj password md5 hash here>")
 
lj_entry(<put journal entry id here>, e[1]$item_id)

Комментариев нет:

Отправить комментарий