Hubble takes the most beautiful images in the universe. They are also cool enough to post the image on their site. I wanted to download their images for use as backgrounds, or art on future openframeworks experiments. Ruby has a couple libs that make this easier. HTTParty and the standard NetLib were used in the script I wrote for pulling the images down. HubbleScrapper doesn’t take any arguments, but does have some interesting tidbits in it.
Following 301
There are two methods used to follow moved responses. HTTParty does this internally, and that is used for the index of the search. However, this was not working correctly (pulled the preview rather than the full image). I found an example method that used the standard lib and incorporated it into my fetch
method. Not sure why HTTParty and the below snippet differ, would have to look at the internals.
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
##
# fetch pulls uri_str using the standard Net package, recurses up to a limit if
# redirected
#
def fetch(uri_str, limit = 10)
# You should choose better exception.
raise ArgumentError, 'HTTP redirect too deep' if limit == 0
url = URI.parse(uri_str)
req = Net::HTTP::Get.new(url.path, { 'User-Agent' => 'hubble-fetcher'})
response = Net::HTTP.start(url.host, url.port) { |http| http.request(req) }
case response
when Net::HTTPSuccess then response
when Net::HTTPRedirection
puts "Redirect Location: #{response['location']}"
fetch(response['location'], limit - 1)
else
response.error!
end
end
Threads
I used a simple consumer model to handle threading. This isn’t producer consumer, since the production is completed before the threading starts. Basically 8 threads are created (later joined), and these threads fetch the image links independently.
1
2
3
4
5
6
7
workers = (0...8).map do
Thread.new {
while url = @image_page_urls.pop
visit_image_page url
end
}
end