SlideShare today announced the biggest change since we started. We are now rendering presentations and documents using HTML5 instead of Flash. This is a milestone. 5 years ago, it was impossible to build something like SlideShare or Youtube without Flash. But the web has finally caught up.
This project was the biggest engineering project in SlideShare’s history. A lot of SlideShare engineering has been working on this around-the-clock for the last six months. As we have learnt over the past five years, people are picky about how their presentations look. Getting the fonts and the text placement to look exactly right across all supported browsers was a real engineering challenge. So we’re happy to finally be able to see this on SlideShare.net.
Ditching Flash for HTML5 feels like the right choice for us for a number of engineering reasons.
- The exact same HTML5 documents work on the iPhone / iPad, Android phones/tablets, and modern desktop browsers. This is great from an operations perspective. This saves us from extra storage costs, and maximizes the cache hit ration on our CDN (since a desktop request fills the cache for a mobile request, and vice-versa). It’s also great from a software engineering perspective, because we can put all our energy into supporting one format and making it really great.
- Documents load 30% faster and are 40% smaller. ‘Nuff said on that front, faster is ALWAYS better.
- The documents are semantic and accessible. Google can parse it and index the documents, and so can any other bot, scraper, spider, or screen-reader. This means that you can write code that does interesting things with the text on the slideshare pages. You can even copy and paste text from a SlideShare document, something that was always a pain with Flash.
What were the most challenging parts of this project? Glad you asked.
Font Conversion
Font handling was the biggest challenge. We had to build support for rendering arbitrary fonts in your browser that are not available on the client. If you invent a new font, and upload a pdf that uses it, it should still render perfectly on SlideShare. Whoa!
Text Placement
Placing the text is very tricky due to differences between different browsers, differences between fonts (handling ligature), and several other complexities. To illustrate: the PDF coordinate system starts in the bottom left. HTML starts in the top left. Pdfs use points, HTML you get your choice of unit, however no two browsers agree on how precise any particular unit is! The largest problem we face with placement is normalization. We spent a lot of time finding that magic combination of em’s, percentages and zoom which gives us correct placement across the web.
Error Handling
We also built a system to find out when there is variance between an image of the HTML output and an image generated directly from the document. If there’s more than a certain amount of variance, we consider that an error and we won’t serve that page as HTML5. Instead we’ll serve a png image of the page when that page is requested. There was some hard-core computer vision involved in the error-handling system. The way we look at it, we want to serve HTML5, but not at the expense of a document that looks bad and disappoints the author.
Cloud Computing
Our conversion stack runs on Amazon EC2 and is configured and managed by Puppet. We’ve been using EC2 for our conversion stack for years, so we’re old hands at that stuff. For this new system, we started out with a number of different types of servers (a font extractor, a font generator, etc). What we found out is that the coordination time between different machines (using Amazon SQS) and the IO time (using S3) were a huge bottleneck. So our architecture for this new system is more remenicent of the netflix “Rambo” architecture. Each box is a self-contained system that can do the entire job of conversion, with no help from anyone.
As we speak, an army of hundreds of Amazon EC2 instances is crunching away at converting the *millions* and *millions* of presentations and documents that have been uploaded to slideshare over the last 5 years to HTML5. New documents will automatically be converted to HTML5 from now on. We hope to have the transition complete by the end of the year (maybe sooner, but no promises!). At that point all slideshare content will be served as Html5.
Next Steps
This is a work in progress … we are betting the company on HTML 5, and are going to continue to invest in the HTML5 conversion stack and JavaScript player technologies that we’re releasing today. Some of the next things on our plate include
- Handling Z-indexes (objects occluding other objects) better
- continued development on our font extraction techology
- Adding some features that we just weren’t able to port to our html5 player in time for this launch, like embedded video and synchronized audio.
Source Sho Tools