At the DSpace user group in Gotenburg. Rob Tansley (Google) giving a talk about search engines, and their application to DSpace.
Background on Search Engines.
Search engines crawl over the pages following links and index the pages they crawl over.
The needs to do 3 things
1. Discover the site (DSpace instance)
2. index all the pages (items and bitstreams)
3. Retrieve enough information to judge relevance and display effective snippets.
Google Scholar differs only in step3.
First thing to do is :
Register for webmaster tools on each search engine
- Bing http://www.bing.com/webmaster
- Google https://www.google.com/webmasters/tools
- Yahoo https://siteexplorer.search.yahoo.com/mysites
Usually you have to leave a file on the DSpace site to prove you own the domain. Its different for DSpace than simple web sites, and requires updating the install after dropping your file in to the correct place:
- JSPUI: Drop in webapp directory (alongside robots.txt) and update install
- XMLUI: Drop in the webapp/static add a line to sitemap.xmap and update install
Sign up for Google Analytics, free, useful statistics – Support in DSpace 1.5.2
- XMLUI: Simple config paramater
- JSPUI: Add analytics code to your footer-default.jsp
Upgrade! as later versions of dspace have significant improvements for indexing by search engines.
Site Discovery issue: Dspace may have multiple URLs,
- Choose one preferred access URL and include a choice of http or https
- Ensure other urls respond with 301 moved permanently redirect
- Ensure Handles redirect to the preferred URL
This can be done with Apache/Tomcat config
- (login can still be https, not as important to have these on single url)
It is easier to manage if DSpace has own domain, specific robot.txt and configuration. Doesnt impact on discoverability, however you can do google custom search engine etc for a single domain.
How to verify its being discovered?
- Search for site:url (site:dspace.mit.edu)
- This works on Google, Bing, and Yahoo
- Google Scholar does not support this, its best to search for a complete title.
In terms of indexing items, engines use standard link following web crawlers.
They dont use OAI-PMH because of some key reasons:
- usually minimal meta data
- usually no access to full text
- No predictable relationship of OAI url = this dspace url
- Often no link to item itself
- very small minority of sites use OAI-PMH
The most important thing is robots.txt
If in doubt, don’t block! – 1.5 and 1.5.1 ship with a bad robots.txt file
Look for this and remove it!
- disallow: /browse
Note: robots.txt has to be at the top level of a domain eg http://dspace.foo.edu/robots.txt
A good way to check your site is via a text only browser. View your DSpace site with the Lynx text-only browser from outside your network. Helps check the site is ok for a search engine to effectively navigate site and bitstreams.
As of DSpace 1.5 sitemaps are supported. presenting pages purely for search engine consumption: a “browse UI” for Web crawlers – its a static file and makes it easy to find new content. Very cheap and good way to keep server load down. DSpace supports both types of sitemaps. a simple html sitemap and the sitemaps.org protocol.
The sitemap html version works for all search engines. This map is generated by a cron job once a day usually. The front page must have the link to your page link to htmlmap, and you can also use this htmlmap using webmaster tools. This is optional. To verify : search for site:url sitemap
Sitemaps is an XML based format support. Instead of a html link, you add a sitemaps link to your robots.txt file. Then submit it to the search engines. Add each engines update URL to the dspace.cfg to prompt search engines to re-crawl. You need to check apache logs to see if the sitemap url has been read. (thoughts: maybe it would be a nice idea to have this logged in dspace admin area)
Returning useful data and your ranking.
Search engines need access to full text, meta data is not as useful. It uses it to judge relevance, ranking and creating useful result snippets. Also used by Google Scholar citation analysis. This is much more important than metadata for search engines.
For restricted content, consider allowing search engine IPs to access the items, with a search engine group with IP authentication. add <meta name robots and content=”noarchive”>
Content of Bitstreams
Make sure search engines can read and interpret contents of bitstreams.
- For documents, ensure pdfs contain text and not just image
- Fewer files the better not one file per chapter – makes citation analysis harder splitting ranking on each chapter
- Word is ok, text best.
Ensure abstract is descriptive for non document items – this is far more valuable than other meta data fields/
Often users who hit your site by hitting full text directly wont be able to navigate to your DSpace isntance. They just see the file contents. There is no easy way to stop this. Thoughts: perhaps an approach to look at doing Moodle file.php approach for delivery of this to keep framework.
Handles dont fit into search engines approach to the web. Links to handles may not improve ranking as much as links to the items splash page. No real way to handle this.
Metadata in HTML Headers
DSpace 1.5 + supports including the metadata in the HTML <head> of each item and links to the full text. This lets search engines (esp scholar) parse metadata despite layout changes. If you have customised the registry, ensure you update mappings in the configs/xhtml-head-item.properties. Make sure any customisations you use dont leave these headers out.
Lightening Server Load
if-modified-since header is in DSpace 1.4 or later. They do not have to retrieve unchanged content.
Use sitemaps which prevents crawlers hitting all the browse pages
Careful crafting of robots.txt
- Block /browse once robots.txt
- check back 1 week and 1 month later to ensure updated items have been indexed