Blogarithms

Doug Kaye’s Weblog

6/29/2009

Email Gremlins

12:15 am

So I’ve been having this realy strange problem. I use OS X’s Mail app along with SpamSieve for spam filtering. But recently I’ve been noticing that the spam detection has been hyperactive: way too any false positives. I tried re-training SpamSieve. No help. So then I shut it down altogether: Whoa! I was *still* getting messages sent to the spam folder. Next, all the usual steps: rebooting, re-initializing this and that. Still no help. With absolutely no spam filtering turned on, stuff was still being flagged and moved. (Any of you email geeks starting to get a clue here?)

For a totally separate reason I pulled out my MacBook Pro, and that’s when it hit me. I even caught the nasty gremlin in the act. What was it?

I use Google as my inbound and outbound email server. Yes, I use their spam filtering, too — it’s much better than SpamSieve — but that wasn’t it. Because I have three different email clients (if you count the iPhone) I use IMAP4 instead of POP3 to communicate between those clients and the Google server and keep things in sync. So here’s what was happening: My MacBook Pro had been on and running it’s own instances of Mail and SpamSieve. Messages would come into Google and, in some cases, my laptop would grab them. The copy of SpamSieve on that computer decided some of them were spam and would move them to the spam folder. And because I’m using IMAP4, this change was sent to the server and then to the email client running on the desktop. It was my laptop, running this other instance of my spam filtering software that was moving messages around on the email server and hence on my desktop client. It was downright spooky to see the messages moving without a clue as to why, but as soon as I realized my laptop was also running email, it became instantly clear.

6/27/2009

The Quarter-Million Milestone

9:21 pm

A few hours ago SpokenWord.org passed another significant milestone: 250,000 audio and video programs in the database, all submitted by our 2,287 members and 5,028 RSS/Atom feeds.

6/26/2009

Adventures in Full-Text Search

8:06 pm

SpokenWord.org calls itself a site for “finding and sharing audio and video spoken-word recordings.” Sounds great, but our “finding” capabilities (search, in particular) have been pretty bad. In mid-March I started writing a fancy new full-text search module that worked across database tables and allowed all sorts of customization and advanced-search features. Six weeks and a few thousand lines of code later, I had a new system that…well, sucked. There are all sorts of reasons why, but it sucked. Bottom line: It just didn’t do a decent job of finding stuff.

I then considered implementing something like Solr, based on Lucene. But the more I thought about it, the more I realized that would be only marginally better.

Searching for audio and video programs from a database that will hit 250,000 in the next few hours comes down to a few architectural issues:

  • You’ve got to search the text of titles, descriptions, keywords, tags and comments, which in our case are stored in separate database tables.
  • There are three ways of doing this: (1) read the database tables in which these strings are stored in real time; (2) in background/batch, build a separate table of the integrated text from the separate tables, then search this integrated table in real time; or (3) build the integrated table by scraping/crawling the site’s HTML pages then, as in #2, search that table in real time.
  • Make your search smart by ignoring noise words, being tolerant of (or correct) spelling mistakes, understand synonyms, etc.
  • Develop an ranking algorithm to display the most-relevant results first.
  • Provide users advanced-search options such as boolean logic and restricting the search to a subset of objects such as only searching programs or only searching feeds.

My fancy search code used method #1 and the resulting code generated some of the longest, most confusing and slowest SQL queries I’ve ever seen. And it’s buggy. Solr uses technique #2, and that’s clearly better for all sorts of reasons. #3 seemed like a particularly poor solution because (a) you lose track of the differences between titles and tags, for example, and (b) it’s kludgy. Or so I thought.

But I’ve now implemented technique #3 by outsourcing the whole thing to Google Custom Search and the initial results are spectacular. Here’s why:

  • Scraping HTML may sound kludgy, but it works.
  • Google knows how to scrape web pages better than anyone.
  • So long as you’re keeping the text you want searched in the page (eg, not served by Ajax) Google will find it.
  • Google’s smart-search, advanced-search and relevance-ranking are better than anything you can write or find elsewhere.
  • Google does all of this with their CPU cycles, not ours, thereby eventually saving us an entire server and its management.
  • Google allows educational institutions and non-profit organizations to disable ads.
  • Google does a better job of finding what you want than is possible using an in-house full-text search with lots of customized filtering options.

This last one is important. I spent a lot of time on giving users tools for narrowing their search. For example, I provided radio buttons to distinguish between programs, feeds and collections. But it annoyed even me that users had to check one of these buttons. People would search for “IT Conversations” and find nothing because the default was to search for individual programs not feeds and there are no individual programs with that string in their titles or descriptions. Annoying and confusing.

Then I had a moment of clarity. Rather than proactively providing users control of the object type up front, I came up with another scheme. I changed the HTML <title>s of the pages so that they now start with strings like Audio:, Video:, Feed: and Collection:. This way (once Google re-scrapes all quarter-million pages) the search results will allow you to immediately and clearly distinguish programs (by media type) from RSS/Atom feeds and personal collections. I’ve tried it on my development server and it’s great. Because of the value of serendipity and the fact that Google’s search is so good, I find it’s much more valuable to discover objects in this way than to specify a subset of the results in advance.

Finally, I’ve discovered that Custom Search supports a feature from regular Google search. You can specify part of a URL as a filter. For example, if you want to search only for feeds, you can start your search string with “http://spokenword.org/feed”. The result will only include our feeds. Same for /collections, /members and /programs. How cool is that? (Thank goodness for RESTful URLs!) I have yet to integrate that into the web site — a weekend project — but it means we can offer the user the ability to restrict the search to a particular type of object if that’s what they want.

I’m so glad that Google Custom Search works as well as it does, that I’ve decided not to brood about the six weeks of my life wasted designing, coding and debugging my own search. It was another one of those learning experiences.

Note: Not all of the features described above appear on SpokenWord.org yet, and the maximum benefit won’t be visible until Google re-scrapes the site, but if you use the Search box on the top of the right-hand column you’ll get the idea. Very cool.

6/24/2009

The Submission Wizard

4:24 pm

Making it easier to submit content to SpokenWord.org has always been high on the to-do list. For the past seven weeks I’ve been working on a Submission Wizard, which I hope goes a long way towards that goal. It’s a wizard because it takes what you give it and tries to figure out what you meant. If you supply the URL of a media file, it will then ask you for an associated web page from which it will suggest the title, description and keywords. If you start by supplying a web-page URL, the wizard will scrape that page looking for RSS/Atom and OPML feeds. And whether it finds those feeds or you explicitly supply a feed’s URL, the wizard will give you choices of what to submit and what to add to your collection(s) before showing you all the steps it takes to follow your instructions.

After the RSS/Atom feed parser, which continues to be a maintenance challenge, the Submission Wizard is probably the most-complex single piece of code for the site. It weighs in at about 6,000 lines of new code and it’s certainly not done. Give it a try, and if it doesn’t do what you think it should, let me know. I’m particularly interested in finding more web pages that the Submission Wizard can learn how to scrape.

6/22/2009

Trying to Crack YouTube Videos

9:54 pm

Anyone out there have an idea how to solve this?

Over at SpokenWord.org we’re trying to figure out how to scrape YouTube pages (or pages with embedded YouTube players), then hack a video or ShockWave URL that we can include in the <enclosure> element of RSS feeds. We’ve been able to do this for programs in YouTube EDU such as this page (http://www.youtube.com/watch?v=Y1XpTc1-lh0), which we convert to this media-file URL (http://www.youtube.com/v/Y1XpTc1-lh0&f=user_uploads&app=youtube_gdata). The latter URL can be played by standard Flash players, so we can include it in RSS feeds. But this only works for certain special cases such as YouTube EDU, not for mainstream YouTube pages.

6/15/2009

I’m a TWiT Again

12:11 am

Had a lot of fun Sunday. Drove up to the TWiT Cottage in Petaluma to be on Leo Laporte’s This Week in Tech (TWiT) episode 199. (Wow, the last time I was on was over a year ago!) Leo and the chat room seemed to think it was a pretty good show. The big treat for me was to be able to meet Wil Harris who was also in the studio instead of his usual participation via Skype. Leo is a real pro, and it’s always an honor to be invited to join the show.

6/5/2009

Happy Birthday to You

10:52 am

Today is the 6th anniversary of the first IT Conversations program, which pre-dated podcasting by about 15 months. And who was our second guest? None other than Phil Windley, who is now Executive Producer of the channel. What you may not realize is that Phil has actualy presided over IT Conversations for longer than I did — he began his stint in April 2006 — and has certainly published the majority of the channel’s 1,895 programs to date.

Behind Phil is TeamITC: our worldwide gang of 40 audio engineers, website editors and series producers headed by Paul Figgiani (audio) and Joel Tscherne (producers) who do all the heavy lifting to bring you new high-quality programs every day. The same team also brings you Social Innovation Conversations, in collaboration with the Center for Social Innovation at the Stanford Graduate School of Business (Bernadette Clavier, Executive Producer) and soon to be launched, CHI Conversations (Steve Williams, EP).

We’re so used to doing this day in and day out, that it’s easy to forget our own history. For example, as Ian Forrester pointed out this morning, we were one of the first (perhaps *the* first) to publish conferences online for free. It began with our live audio streams from the O’Reilly Digital Democracy Teach-In and Emergenging Technology Conference in Ferbuary, 2004.

Here’s a special thanks to everyone on the team including those who have helped us and moved on to other endeavors. Approximately 145 people have ben members of TeamITC at one time or another. And thanks to all our listeners, fans and particulary donors and supporters who help us pay the bills.

Happy Birthday to You.

5/29/2009

Would You Slap Your Father?

1:06 pm

Nicholas Kristof wrote about an interesting survey that correlates morals, disgust and other feelings with those who identify themselves as being liberal or conservative. There are lots of interesting and fun surveys on such topics at YourMorals.org. Looks like I’m more liberal than most liberals. (My responses in green.)

SlideShare.net Offers MP3 Hosting

10:39 am

Our friends over at SlideShare.net have just announced a much-requested feature: unlimited hosting of MP3 files used to make multimedia slidecasts. Previously you had to find a separate place for your audio file, then synchronize it with your slides hosted on SlideShare.net. Now you can host everything in one place. This should make it easier for anyone to publish a slideshow-based audio presentation.

5/18/2009

Airplane WiFi — Isn’t This Old News?

11:16 pm

Everybody (well, Joe Sharkey in the NY Times and lots of bloggers and Twitterers) are foaming at the mouth about the trials of WiFi in commercial airliners. I must be missing something here. I flew on Lufthansa from San Francisco to Munich three years ago with a terrific WiFi connetcion. I believe they discontinued the service because of the cost, but there certainly didn’t seem to be any major technical glitches. True, I did expect in-air WiFi would be more commonplace by now, but I think the press and blog coverage is missing an important fact that this isn’t something new.

5/16/2009

Terms of Service

10:53 pm

I guess it had to happen sooner or later. Someone submitted a hard-core porn video feed to SpokenWord.org. (No, don’t go looking for it!) Maybe we’ve just been lucky thanks to keeping a fairly low profile. We do accept RSS feeds with content tagged as ‘explicit’ but there’s explicit (perhaps just audio with adult language) and then there’s really explicit. I’m thinking of dealing with it in a few ways.

  • No content tagged as ‘explicit’ on the home page.
  • Content tagged as ‘explicit’ is invisible to those who have not opted-in via their profiles.
  • In order to opt-in, you must read and agree to the Terms of Service *and* you must claim to be 18 years old or older.
  • You can register without explicitly agreeing to the Terms of Service, but the first time you submit content, you’ll have to explicitly. (Sorry for the re-use of the word ‘explicit’.)

I’ve got no experience with this, and I welcome suggestions. In particular, if you can recommend another site that handles occasional explicit content well and/or has good Terms of Service, let me know.

5/15/2009

Ratings Now in SpokenWord.org RSS Feeds

11:31 am

All RSS feeds generated by SpokenWord.org now include program rating data according to the conversationsNetwork namespace. If a program has been rated, the <item> for that program now includes the ratingAverage and ratingCount elements. If the feed was requested with authentication (identifying a particular SpokenWord.org member) it will also include the ratingIndividual element. The ratingTimestamp element has not yet been implemented. I’m still trying to figure out if it’s worthwhile and whether it should reflect the last rating by the authenticated individual or by anyone.

5/13/2009

Open Video Conference

11:11 pm

I’ll be attending the first day of the Open Video Conference at NYU, June 19-20. Looks like an interesting event.

5/12/2009

Users Tell Us What’s Wrong

12:58 pm

You can always count on loyal website visitors to tell you what you really need to hear. I emailed a survey to the registered members of SpokenWord.org this morning and already have some great responses. Here’s a sample of the answers to “What do you NOT like about SpokenWord.org?”

  • collections
  • No Comment
  • Not sure I totally understand how it works.
  • I do not understand how I am supposed to use SW, and the web pages don’t make it manifest. What is “collect” and what does “subscribe” imply? Unclear. Yes, I can spend a lot of time clicking “help” to “FAQ” to “Advanced: Collection” but I still feel that I’m using a Swiss Army Knife as a club. Awkward and unfamiliar.
  • - the search is annoying; I wish it would search both feeds and episodes, without having to go to a separate page. - the look is quite cluttered and visually messy - it’s not good for discovery; I don’t find the homepage content useful — it’s rarely something that I want to listen to, and doesn’t change frequently enough. I haven’t looked at other’s collections much, maybe that would be helpful.
  • Jason Ponten
  • Too early to tell
  • Too much data entry required to add a single program. The feed reader should be more forgiving. I’ve tried to add several feeds that failed, but I assume they work find with iTunes or other feed readers.
  • Removing individual programs from collection (that was added trough feed) was not working, but now that seems to be fixed.
  • Removing individual programs from collection (that was added trough feed) was not working, but now that seems to be fixed.
  • Nothing
  • I’m not so sure about the Stack Overflow-type badges and such, though I’m always a late adopter in the social media thing.
  • I used to get these emails from ITC, and I just manually downloaded each one and put it in a directory. Now, I don’t know where to find that stuff. there seems so much, it’s confusing to a simpleton like myself.
  • Same as above
  • not friendly for new users. not clear what should you do there..
  • I am not sure how to use it.
  • - No audio podcasts of video talks available - The time lag in updating my feed after making changes to my collection. - The strictness in parsing RSS feeds has not allowed me to move all of my podcasts over to Spoken Word.
  • to much mumbo jumbo and does not seem smooth
  • It was difficult to figure out at first.
  • Wasn’t obvious how to subscribe to a feed, although I just went back and found it.
  • Not live. New feeds sometimes take days to add programs to my collections. When a new feed is added you should give the ability to add a small number of older programs immediately to test the feed.
  • I just didn’t find a lot of podcasts that I hadn’t already found. It’s been a while since I checked. I’ll look again and maybe this opinion will change.
  • Cluttered UI.
  • Some of the feed parsing is pickier than I thought was necessary. If I make a collection on SpokenWord and subscribe to its feed on my PC its not always easy to differentiate which original feed an episode is from. So I don’t use this feature. it’s not really your fault but there isn’t an easy way to contribute to SpokenWord other than adding feeds. I download podcasts with gpodder on my PC and I don’t tag or rate podcasts because it’s a lot of extra effort.
  • can’t get to it ALL!
  • Off the top of my head? Nothing.
  • Too hard to find things. I am pretty new at this. I also visited TED. I felt it was easier to find interesting stuff there.
  • Can’t think of anything… Tue, 5/12/09 9:14 AM
  • Well, for one thing, the feedback link to tell you what didn’t work didn’t work. And here’s an obvious but unappreciated idea: I signed up to hear a progran that wasn’t there. Can you build some kind of machine that’ll delete busted or cancelled links? Also, I had a little too much difficulty finding the actual link to the program. In fact, more than a little too much.
  • Search is so broken! I search for my own podcast and it doesn’t show up in search results - even though I’ve submitted it. I have to type the exact url. Related keywords are useless. Also, I would really, really encourage you to create multiple lists of podcasts broken down by various categories, topics, niches, sub-sub niches, brand-new podcasts, etc. Even if these lists are in a separate section of the site (not taking up valuable home page real estate) these would be invaluable to finding/discovering podcasts I haven’t heard about previously.
  • there is an empty yellow popup area on the home page that just says “close window”
  • The layout and searching for new podcasts. Not much Canadian Content.

What do you think? Maybe we have a UI problem?

5/11/2009

Kindle Needs Profiles

2:36 pm

Here’s a feature I’d like to see in future versions of Amazon’s Kindle’s software: user profiles. My wife and I would like to share the same Kindle, she with her books, me with mine and both of us sharing the NY Times, the New Yorker, etc. The best way to do that would be for us each to have separate bookmarks, so that I could just pick up the Kindle, switch to my book marks and pick up in the document and on the page I was last reading.

5/8/2009

One Million and Counting

12:27 am

Bernadette Clavier, Executive Producer of our Social Innovation Conversations channel, says we just passed the one-million audio download milestone over there. Congratulatons to Bernadette, Leah Silverman and the rest of our SIC producers, writers and engineers.

5/5/2009

Bug- and Feature-Tracking Report Online

1:17 pm

By popular demand, I’ve posted a report from Jira, our issue-tracking system, for all to download. Many (most?) of the items won’t make sense to anyone who’s not actually writing code for SpokenWord.org, but you may find it interesting. By all means, feel free to peruse it and if you find a feature request that you think should be escalated (or one that should be added), let us know. The best place to discuss bugs and new features is on our Spoken Word Strategy discussion list.

What Would it Take?

1:15 pm

Now that SpokenWord.org out of the alpha/beta phase…

  • What would it take for you to:
    • recommend SpokenWord.org to your friends?
    • send a link to your collections to your friends?
  • How should we spread the word?
    • a JavaScript or Flash widget for displaying your collections on other web sites?
    • a better email-to-a-friend feature?
    • better links to social networking sites?
    • direct interfaces to Facebook, FriendFeed, etc?

Give us your ideas on our Discussion List.

5/3/2009

SpokenWord.org UI Improvements

12:09 pm

I’ve just enabled some much-requested UI improvements to SpokenWord.org. Over time we’ve been displaying more and more metadata on the detail pages for members, programs, feeds and collections and you’ve told us they’re just too cluttered and hard to navigate. As of early this morning we’ve moved most of the metadata and actionable links on those pages to the right column. I hope the data are easier to find and the pages are now eaier to read.

4/30/2009

Swine Flu: Why All the Hype?

10:36 pm

As I wrote on Twitter earlier today: “35,000-50,000 die each year from influenza [and subsequent bacterial pneumonia] in the U.S. The panic over this swine flu is whacko. Is it just a slow news week?”

Okay, so I can understand the media frenzy — we’re all used to it — but what about the Obama administration? Janet Napolitano (Homeland Security) and CDC officials aren’t just accepting interviews, they’re pushing this. When’s the last time a U.S. president opened a major press conference suggesting we wash our hands and stay home from school or work? It’s really quite extraordinary. And puzzling.

And then it hit me. Of course! The Obama administration is scared to death that this could be their Katrina. They know the chances of the H1N1 flu becoming a true national tragedy are quite slim, but they’d rather risk the consequences of overreacting than take the chance they’d be blamed for an inadequate response. For them it’s not about the likelihood the swine flu will become something we should worry about. Instead, it’s about *their* risk of being associated with the previous administration’s national-response failures. Once I figured that out, it all made sense.

I don’t want to give the impression that this isn’t a potentially serious situation. I believe it is, particularly because the virus has essentially spread worldwide and development of a vaccine has just begun. I hear it could take 9-12 months before large quantities can be available and by then 25%-30% of the population could have already been infected. And we also don’t know how virulent this microbe really is. Only one fatality yet in the U.S., and that was a young child who came here from Mexico. 160+ deaths in Mexico, but we don’t yet have a clue of the actual death rate. How many were infected and recovered? If the virus is already widespread, 160 deaths could actually suggest a milder-than-usual outbreak.

Bottom line: It’s early. Let’s all hope the pandemic (already declared by the WHO) turns out to blow over quickly. But in the meantime, there’s certainly no shortage of attention being given to it, from the White House on down.

Powered by WordPress