I'm not an expert in PHP (I'm really just beginning to grasp the thing) and I need a little help to customized a script for extracting titles from Tumblr's post (and I insist, it's a small problem, I'm not asking for the whole script to be written for me).

For those who doesn't know: Tumblr is a microblogging platform with the characteristic of offering pre-formated type of post (quote, photo, etc.). None of those type, except for one, include title.

I'll do my best and try to make this question as specific as possible so I can get an answer as specific as possible. I'll make it brief.

I'm starting with Ben Ward script Tumblr2Wordpress (available on GitHub). Here's the part that acts as a title extractor:

Now, here's how I understand it works (for those who do not read PHP: I don't but I can guess):

Create a function called formatEntryTitle that we'll later use to format the output of the data fetched by reading the API of a Tumblr blog.

Then, there's a code to count the number of lines in order to control the search. The idea here, Ben's idea, is that this title extractor will search the post content for a specific item. If the search fails after a certain number of lines (or block) the query will end and return nothing.

But what to search in order to create a title? Ben's idea is simple an, in some case, very efficient : the code will search for a HTML header (h1, h2, h3 etc.). That's the role of the preg_match command.

If it does find a header, it will strip it of a) any link if it does have a link; 2) any other HTML tags. Keep in mind that this extractor was build to create title for Wordpress. In Wordpress, the links in the title of any post is usually its permalink.

There's also a part in there that does strip out of the content the words that were found and are used for the title so the content of the post doesn't duplicate it's title (maybe it's an aesthetic decision, or something related to SEO : I don't know).

Later, when formatting the data retrieved by reading Tumblr's API, the script use an argument like that to create the title:

htmlspecialcharsis a PHP string function designed to convert special characters to HTML entities. Then we make use of the formatEntryTile to format the $post_content (which is defined individually for every post type in regard with Tumblr API structure), that is to extract a title from the content of any post type.

So far so good. Except for two things: 1) This title extractor will work IF and only IF you made a systematic use of HTML headers in your blog. Otherwise, the extractor won't find any h1or h2or h3 etc and will return nothing. Basically, none of your post will get any title after the export process is done. 2) I didn't use headers on my blog. But I set up a test blog with headers to see how it works. I was never able to extract any title. Maybe it's me, the way I made use of headers in the body of the post. I don't know. It doesn't matter for me: I don't want to fix Ben's script (I'm not even sure its broken), I want to customize it instead.

(That's where Stack Overflow comes into play.)

It should be simple. As I said, I don't know much about PHP, but I'm already halfway there. My idea is the following : the extractor could search for ALL the post content... and truncate it. That's it. Simply get the first few words of the post, and use them as a title. I know it's not a perfect solution for everyone (titles may not be always relevant of the content and they will partially duplicate the content of the post) but 1) At least it would work for those who do not make use of HTML headers; 2) In my case, it's great because the first few words of each of my post is content attribution : creator's name, name of the photo or book, year, etc.

I found this little PHP code by Chirp Internet. It's a simple truncating function. It can be customized in many ways. Moreover: it works. Here's how I initially tried to use it.

I kept the name of the function formatEntryTitle but emptied it of its content and replace it with Chirp's code:

Then for each post type, I first define a string $title like so : $title = formatEntryTitle($post_content, 40, " "); (where "40" is the number of character I want the post content truncated to and the blank space is the criteria to end the truncating process, in plain English: truncate-at-the-first-blank-space-after-40-characters) and then use the following argument to output the title itself: <title><?php echo htmlspecialchars($title) ?></title>

And it works. It really does.

Except for one point. And that's where I need some help: my titles are... full of HTML tags. I need to clean the truncated part of any HTML tag so I can get a plain, clear, English-only title.

I've tried to make use of Ben's code (his title extractor strips HTML tags from the title) but it's beyond what I'm capable to do. I think it's a problem related to the hierarchical structure of the extractor: I don't know where to put the strip function.

I know it's a long post, but hopefully someone will see the solution in a flash.

Can you give some examples of the format of the posts and indicate what part of the post should serve as your title?
–
prodigitalsonOct 20 '10 at 20:39

Sure! The two types I use the most are Quote and Photo.
–
ParneixOct 20 '10 at 21:30

Quote post are constructed in this way: 1) First the quote itself. From the viewpoint of Tumblr API, it is <quote-text>. Than, the caption (where the quote comes from) and optional comments. In the API, this is <quote-source>. I want my title to come from the source. I can manage that in Ben's code. I already did. In general, my caption start with a link (to the book of magazine the quote comes from, for example), followed by the author name, the year, etc. That's why I can totally create adequate title with the first few words of the caption.
–
ParneixOct 20 '10 at 21:34

Same goes with Photo post. I'll make use of <photo-caption> from Tumblr API. Ben Ward is already using this in his script. In both case captions can be short (name, place, year) but they also can be quite long if I've decided to add comments, talk about the context or whatever. That's why I need the truncating function. The caption itself does not guarantee that the text will be short enough to be able to act as a title. I'll be glad to give more example or to explain myself further if needed.
–
ParneixOct 20 '10 at 21:39