A while ago I've noticed that Twitter changed their design for the users images.
Previewing an image, we used to have the option to receive a picture grid of all the images the user posted however it seems that Twitter disabled the feature allowing visitors to view one image at a time.
The missing grid view is a pain and a lot of people want it back as seen in this support thread at Twitter:
https://dev.twitter.com/discussions/9843
So I made a decision to create an application that will let me get a profiles images easily.
The Obvious way to go about programming this is using the Twitter API (after reading the documentation of course :-) ).
I've notice two Twitter api functions that allow us to reach our goal using the plain and simple HTTP GET protocol.
1 -
http://api.twitter.com/1/statuses/user_timeline.xml?screen_name=CodeBTL
The
user_ timeline.xml command returns a simple XML file with the recent tweets.
The function supports additional parameters like
count that allows you top specify the max amount of tweets to receive and
trim_user that removes appended user data.
 |
Xml result from the Twitter user_timeline function |
Notice that for each tweet we get the tweet id and the tweet text.
2 -
http://api.twitter.com/1/statuses/show.xml?id=279580723336331264&include_entities=1
The
show.xml function receives a tweet id and returns the XML description.
Like most functions on twitter, the function supports additional parameters the most important one for us is the
include_entities that can show us if any media links exist in the tweet allowing us to take the link and display it.
 |
Xml result from the Twitter show function |
When testing my application I found a big problem.
Twitter limits unauthenticated requests to 150 per hour and each of our GET REQUEST's counts as one. Reference
https://dev.twitter.com/docs/rate-limiting
That means we can only check under 150 tweets for images and this really limits us.
I tried registering and application with Twitter and authenticating a user for the requests but the limit seems to be final.
The only option a saw to bypass the limit is
not use the Twitter API and to do some HTML scraping.
I created a simple form allowing the user to enter the twitter username he want to retrieve images of.
Once selected I used the first URL on this post figuring that for now 150 user requests per hour was enough. The result is an XML file with up to 200 status nodes.
I iterate through all of the status texts looking for links.
Twitter changes all posted links (for media files uploaded too) to the twitter format of t.co
to check we can use a regular expression
Match m = Regex.Match(TweetText, @"(?<twitterURL>t.co)/(?<subdir>[^\s]*)");
If we found a link we cant just download since there will be a redirection from twitter to the location of the original link. In order to capture the actual image we send an HTTP HEAD request to get the redirect URL like the following:
var request = (HttpWebRequest)WebRequest.Create(new Uri(@"http://t.co/" + m.Groups["subdir"].Value));
request.Method = "HEAD";
request.AllowAutoRedirect = false;
string location;
using (var response = request.GetResponse() as HttpWebResponse)
{
location = response.GetResponseHeader("Location");
}
Upon receiving the new location we can notice that we are not redirected to the image itself but to a web page displaying the image.
Currently I support two types of images, hosted on twitter (URL starts with twitter.com and contains /photo/ ) or hosted on instagram (URL starts with instagr.am).
To finish our scraping session we need to find on the web page the correct image tag.
For twitter, the img tag class attribute has the value "large media-slideshow-image".
For instagram, the img tag class attribute has the value "photo".
Since all of the code up to this point was self written, I didn't want to use the HTML Agility Pack or a different third party component. So using regular expressions again I write the GetImageTags function
private List<ImgTag> GetImageTags(String html)
{
List<ImgTag> imgTags = new List<ImgTag>();
MatchCollection m1 = Regex.Matches(html, @"(<img.*?>.*?>)", RegexOptions.Singleline);
foreach (Match m in m1)
{
string value = m.Groups[1].Value;
ImgTag imgTag = new ImgTag();
Match m2 = Regex.Match(value, @"src=\""(.*?)\""", RegexOptions.Singleline);
if (m2.Success)
{
imgTag.src = m2.Groups[1].Value;
}
m2 = Regex.Match(value, @"class=\""(.*?)\""", RegexOptions.Singleline);
if (m2.Success)
{
imgTag.classAtt = m2.Groups[1].Value;
}
imgTags.Add(imgTag);
}
return imgTags;
}
If we retrieve the src attribute value, all we have left to do is download the image.
You can get the Twitter image downloader application at my codeplex project page
https://twitterimagedownload.codeplex.com/
Take a look at the source code. Recommendations and remarks welcome.
Update 29/6/2013 : I've updated the project to support the Twitter API Ver. 1.1.
That should fix the crash issue that occurred when fetching the images.