Scraping Tumblr

How I migrated my blog content to Hugo.

Today I completed a longstanding project to move my blog entirely over to Hugo. Previously, I maintained a blog on my consulting site, crome-plated that was powered by Tumblr. I used Tumblr to easily create content, and then I would pull that content into my own site, transforming it to look like native content. The content would often grow stale though, and I felt like it was putting a bad foot forward to prospective clients. I didn’t want to abandon the content, though, so I decided to resuscitate my personal blog.

Hugo is powered by Markdown. Tumblr is formatted with HTML. I didn’t have many entries, and so naively thought that I would copy/paste/edit the old entries into Markdown. By the second post, I was done with it, and decided to automate this process with Perl. By the end, I might have spent more time automating than I would have copying and pasting, but this way was way more fun!

The Tumblr API makes it pretty easy to pull your posts. Transforming them takes some work. HTML::FormatText makes the bulk of the process easy, converting a page full of HTML into text, but it unfortunately strips the link content out. With a little regex magic, however, HTML links could be translated into Markdown links:

$body =~ s/<a href="(.+?)" target="_blank">(.+?)<\/a>/\[$2\]\($1\)/g;

After that, HTML::FormatText will not recognize the links as anchor tags and consider them to be plain text.

There is nothing else complicated about the process. My utility, tscrape, uses subroutine signatures, and in my opinion, reads better than code that uses shift and @_ for parameter passing.

The full code of tscrape:

#!/usr/bin/env perl

use lib 'lib';
use utf8;
use Modern::Perl;
use DateTime;
use WWW::Mechanize;
use Text::Unidecode;
use Cpanel::JSON::XS;
use HTML::FormatText;
use feature qw( signatures );
no warnings 'experimental::signatures';

say "Fetching posts...";
my $blog_url = "http://api.tumblr.com/v2/blog/your-blog.tumblr.com/posts?api_key=your-api-key";
my @posts    = get_posts();

foreach my $post ( @posts ) { 
    my $body     = get_body( $post );
    my $title    = get_title( $post );
    my $date     = get_date( $post );
    my $tags     = get_tags( $post );
    my $filename = get_filename( $title );

    # Create blog file
    print "Creating new post: $title ($date)... ";
    open my $blog_fh, '>', "scraped_content/$filename" or die "Can't open $filename: $!";
    print $blog_fh qq{---
title: $title
hero_image: "hero.jpg"
tags: $tags
date: $date
---

$body};

    close $blog_fh;
    say "done!";
}

my $dt = DateTime->now;
say "Completed at " . $dt->ymd('/') . " " . $dt->hms;
exit 1;

sub get_body( $post ) {
    my $type = $post->{ type };
    my $body;
    if( $type eq 'text' ) {
        $body = $post->{ body };
    }
    else {
        $body = '';
    }

    $body =~ s/<a href="(.+?)" target="_blank">(.+?)<\/a>/\[$2\]\($1\)/g;
    $body  = HTML::FormatText->new( leftmargin => 0, rightmargin => 72 )->format_string( $body );
    return unidecode( $body );
}

sub get_title( $post ) {
    return unidecode( $post->{ title } );
}

sub get_date( $post ) {
    my $date = unidecode( $post->{ date } );
    $date =~ s/ /T/;
    $date =~ s/ GMT$/\-00:00/;
    return $date;
}

sub get_tags( $post ) {
    my @tags = $post->{ tags }->@*;
    my $taglist = '[' . join( ', ', map{ '"' . $_ . '"' } @tags ) . ']';
    return $taglist;
}

# Make a filename that won't cause the filesystem to shit the bed
sub get_filename( $title ) {
    my $filename = $title;
    $filename =~ s/ /_/g;
    $filename =~ s/\W//g;
    $filename =~ s/_/-/g;
    return lc "${filename}.md";
}

sub get_posts {
    my $mech = WWW::Mechanize->new;
    
    my $response  = $mech->get( $blog_url )->decoded_content;
    my $post_json = decode_json( $response );
    my @post_list = $post_json->{ response }{ posts };
    return @{ $post_list[0] };
}
comments powered by Disqus