Semantically Tuning a Web Site
From bemoko developer wiki
Introduction
A web site can essentially be considered a web service - albeit a web service that is typically accessed by humans using an internet browser. However the format of a web page can vary hugely, with content interwoven with stylistic instructions; intricate client side logic; and even non-compliant markup. Internet browsers have evolved to be tolerant to a level of non-compliance and humans are capable of picking the information they are looking for.
There are opportunities to improve the markup of a website which can make it easier to reliably extract information programatically from site - i.e. Screen Scrape[1]. It's worth noting that as a pre-requisite to making the site accessible to machines, one should first improve accessibility to humans - cf. W3C accessibility standards[2]. The opportunities to improve the markup lie in adding semantic information to a web page such that it is readily accessible to both humans and machines, effectively making the web site a well-defined web service / API ... one step towards the Semantic Web goal when "computers become capable of analyzing all the data on the Web"[3]. HTML5 [4] provides further steps towards improving the semantic quality of HTML markup - and although not yet supported by all browsers, provides interesting lessons to help you decide how best to semantically improve your HTML markup today.
Following the best practices discussed in this document, its possible with bemokoLive, to reliably use an existing PC web site to drive the content and business logic for a mobile web site. It can also be considered as a low risk first step towards a fully integrated web site that has the intelligence to deliver effective user experience for all devices that might connect - mobile and PC.
Semantic HTML
Semantic HTML[5] refers to the creation of HTML with the use appropriate elements (cf. <em> vs <i>) and application of semantically named classes and ids.
Events
Consider the following non-semantically designed HTML:
<div class="blue"> Meet John in London @ 12:30<br/> <i>Remember to bring the book he lent me</i> </div>
This markup will render fine in a browser. A human reading it can quite easily pick out the relevant parts of the message, but if you sat down and tried to access the information programatically you'd have to think carefully how to extract the information. Even if you created a way to extract the information today, can you be sure that the format of the message won't change in the future and possibly break your extraction logic?
If, however, you wrote the information using the hCalendar Microformat standard[6] it becomes a lot easier to programatically extract the information and still be as readable to human with an internet browser, e.g.:
<div class="vevent"> <div> <span class="summary">Meet John</span> in <span class="location">London</span> @ <abbr class="dtstart" title="2007-10-05T12:30:00">12:30</span> </div> <div class="description"><em>Remember to bring the book he lent me</em></div>
By adding classes in this manner, it also becomes easier to make stylistic changes, or re-skin, at a later date by style sheets changes.
Forms
Consider another example where we have form for registration:
<form id="registration" method="post" action="/register" > <div class="field"> <div class="label">First name</div> <input class="given-name " name="given-name"/> </div> <div class="field"> <div class="label">Surname</div> <input class="family-name" name="family-name"/> </div> <div class="field"> <div class="label">Favourite Film</div> <input class="favourite-film" name="favourite-film"/> </div> <input class="do" type="submit" value="go"/> <input type="submit" value="register for newsletter"/> </form>
- With the id attribute set to registration, it is easy to pick out the registration form from the page (there might be several forms on the page).
- For example the form element can be accessed with the XPath //form[@id='registration']
- The class="do" makes it easy to find the primary submit button that simply registers the user (as opposed to registering for the newsletter which might be considered as supplementary to the process).
- The class names of the input fields have been aligned with the hCard Microformat standard[7] so that even if the application had used different values in the name attribute, it'd still be easy to identify the appropriate input fields in the form.
- The mobile rendering of this registration might be designed such that the favourite film input field is ignored - since it might not be considered essential for the mobile pages. The naming of the fields in a reliably named manner makes this easy to achieve.
Pages
When processing pages dynamically it is useful to be able to reliable determine what type of page we're looking at. For example is it a login page, a results page or a news article page? It might even contain multiple types of content, for example a portal page.
There are several ways to do this. For example by setting an id on the body element, e.g.
<body id="login">or by adding an id to an element which as the class set to type, e.g.
<div id="login" class="type">This second technique may be easier to apply if your pages share a common skin (which includes the body element) or you have multiple types of content on the page. With this second technique you can easily get the type (or types) of content in the page with the XPath //*[@class="type"]/@id.
How can bemoko Mobilise your Web Site?
We can use bemokoLive in various deployment modes - one of the easiest ways to get started is by using bemokoLive as a proxy. You can create a site that extracts the content from your PC website and delivers optimised mobile rendering through the site UI. If you're interested - get in touch with us and we'll show you how it works under the covers.
It Is Subjective
There are no hard and fast rules for creating semantic HTML. First port of call should be following the W3C accessibility guidelines [2] to get it right for humans. Microformats provide good specifications that you can align with to cover commons used data structures. And finally, applying appropriate ids on core sections on your page, such as a form or the main article / results of the page, and using well-defined classes, such as class="summary", for key content, can go a long way to making it easier to programatically interact with the site.
References
- ↑ Screen Scraping - http://en.wikipedia.org/wiki/Screen_scraping#Screen_scraping
- ↑ 2.0 2.1 HTML Techniques for Web Content Accessibility Guidelines 1.0 - http://www.w3.org/TR/WCAG10-HTML-TECHS/
- ↑ Semantic Web http://en.wikipedia.org/wiki/Semantic_Web
- ↑ HTML5 specification - http://dev.w3.org/html5/spec/Overview.html
- ↑ Semantic HTML - http://en.wikipedia.org/wiki/HTML#Semantic_HTML
- ↑ hCalendar Microformat standard - http://microformats.org/wiki/hcalendar
- ↑ hCard Microformat standard - http://microformats.org/wiki/hcard
