Scraping Tumblr
Today I completed a longstanding project to move my blog entirely over to Hugo. Previously, I maintained a blog on my consulting site, crome-plated that was powered by Tumblr. I used Tumblr to easily create content, and then I would pull that content into my own site, transforming it to look like native content. The content would often grow stale though, and I felt like it was putting a bad foot forward to prospective clients. I didn’t want to abandon the content, though, so I decided to resuscitate my personal blog.
Hugo is powered by Markdown. Tumblr is formatted with HTML. I didn’t have many entries, and so naively thought that I would copy/paste/edit the old entries into Markdown. By the second post, I was done with it, and decided to automate this process with Perl. By the end, I might have spent more time automating than I would have copying and pasting, but this way was way more fun!
The Tumblr API makes it pretty easy to pull your posts. Transforming them takes some work. HTML::FormatText makes the bulk of the process easy, converting a page full of HTML into text, but it unfortunately strips the link content out. With a little regex magic, however, HTML links could be translated into Markdown links:
$body =~ s/<a href="(.+?)" target="_blank">(.+?)<\/a>/\[$2\]\($1\)/g;
After that, HTML::FormatText
will not recognize the links as anchor tags and consider them to be plain text.
There is nothing else complicated about the process. My utility, tscrape
, uses subroutine signatures, and in my opinion,
reads better than code that uses shift
and @_
for parameter passing.
The full code of tscrape
:
#!/usr/bin/env perl
use lib 'lib';
use utf8;
use Modern::Perl;
use DateTime;
use WWW::Mechanize;
use Text::Unidecode;
use Cpanel::JSON::XS;
use HTML::FormatText;
use feature qw( signatures );
no warnings 'experimental::signatures';
say "Fetching posts...";
my $blog_url = "http://api.tumblr.com/v2/blog/your-blog.tumblr.com/posts?api_key=your-api-key";
my @posts = get_posts();
foreach my $post ( @posts ) {
my $body = get_body( $post );
my $title = get_title( $post );
my $date = get_date( $post );
my $tags = get_tags( $post );
my $filename = get_filename( $title );
# Create blog file
print "Creating new post: $title ($date)... ";
open my $blog_fh, '>', "scraped_content/$filename" or die "Can't open $filename: $!";
print $blog_fh qq{---
title: $title
hero_image: "hero.jpg"
tags: $tags
date: $date
---
$body};
close $blog_fh;
say "done!";
}
my $dt = DateTime->now;
say "Completed at " . $dt->ymd('/') . " " . $dt->hms;
exit 1;
sub get_body( $post ) {
my $type = $post->{ type };
my $body;
if( $type eq 'text' ) {
$body = $post->{ body };
}
else {
$body = '';
}
$body =~ s/<a href="(.+?)" target="_blank">(.+?)<\/a>/\[$2\]\($1\)/g;
$body = HTML::FormatText->new( leftmargin => 0, rightmargin => 72 )->format_string( $body );
return unidecode( $body );
}
sub get_title( $post ) {
return unidecode( $post->{ title } );
}
sub get_date( $post ) {
my $date = unidecode( $post->{ date } );
$date =~ s/ /T/;
$date =~ s/ GMT$/\-00:00/;
return $date;
}
sub get_tags( $post ) {
my @tags = $post->{ tags }->@*;
my $taglist = '[' . join( ', ', map{ '"' . $_ . '"' } @tags ) . ']';
return $taglist;
}
# Make a filename that won't cause the filesystem to shit the bed
sub get_filename( $title ) {
my $filename = $title;
$filename =~ s/ /_/g;
$filename =~ s/\W//g;
$filename =~ s/_/-/g;
return lc "${filename}.md";
}
sub get_posts {
my $mech = WWW::Mechanize->new;
my $response = $mech->get( $blog_url )->decoded_content;
my $post_json = decode_json( $response );
my @post_list = $post_json->{ response }{ posts };
return @{ $post_list[0] };
}