浙北英语网

 找回密码
 立即注册

QQ登录

只需一步,快速开始

湖州日报:3名学子入围“希望之星”只有狂人才干的事情!带字幕美国之音慢速英语VOA Special English学英语视频近1000个,自诩为国内最全!适合增加词汇和锻炼听力!点击查看...
查看: 312|回复: 0
打印 上一主题 下一主题

XML sitemap generator with command line interface for nginx on Linux

[复制链接]
跳转到指定楼层
1#
发表于 2013-8-3 16:32:02 | 只看该作者 回帖奖励 |倒序浏览 |阅读模式

            
    Did you trying Google'ing this? First result on the first page:
https://code.google.com/p/sitemap-generators/wiki/SitemapGenerators
Edit:
As per the comments, I tried out the following sitemap generator:
http://sitemap-generators.googlecode.com/svn/trunk/docs/en/sitemap-generator.html
The downloaded zip bundle contains a few files:
drwxr-xr-x  19 user  group     646 Apr 10 05:22 .drwxr-xr-x   3 user  group     102 Apr 10 05:12 ..-r--r-----@  1 user  group      23 Jun 16  2005 AUTHORS-r--r-----@  1 user  group    1791 Jun 16  2005 COPYING-r--r--r--@  1 user  group    2267 Dec  5  2005 ChangeLog-rw-r--r--@  1 user  group     258 Dec  5  2005 PKG-INFO-r--r--r--@  1 user  group    1111 Dec  5  2005 READMEdrwxr-xr-x   3 user  group     102 Apr 10 05:16 build-r--r--r--@  1 user  group    5662 Sep  7  2005 example_config.xml-r--r-----@  1 user  group     996 Jun 16  2005 example_urllist.txt-r-xr-xr-x@  1 user  group     317 Dec  5  2005 setup.py-r-xr-xr-x@  1 user  group   73063 Dec  5  2005 sitemap_gen.py-r-xr-xr-x@  1 user  group   28551 Sep  7  2005 test_sitemap_gen.pyUsing the provided example_config.xml, I modified it in the following manner:
<?xml version="1.0" encoding="UTF-8"?><site  base_url="http://YOURDOMAIN.com/"  store_into="/var/www/sitemap_gen-1.4/sitemap.xml"  verbose="1"  >  <url  href="http://YOURDOMAIN.com/stats?q=name"  />  <url     href="http://YOURDOMAIN.com/stats?q=age"     lastmod="2004-11-14T01:00:00-07:00"     changefreq="yearly"     priority="0.3"  />  <urllist  path="urllist.txt"  encoding="UTF-8"  />  <!-- Exclude URLs that end with a '~'   (IE: emacs backup files)      -->  <filter  action="drop"  type="wildcard"  pattern="*~"           />  <!-- Exclude URLs within UNIX-style hidden files or directories       -->  <filter  action="drop"  type="regexp"    pattern="/\.[^/]*"     /></site>I think that serves as the template for generating the sitemap.xml.  Now, the generator supports pulling URLS from apache style access logs or pulling from a url list file. I opted to pull from a url list file, since I was testing from my laptop.
To generate the url list, I employed 'wget' to spider the site:
wget -mk --spider -r -l2 http://YOURDOMAIN.COM/or
wget -mk --spider -r -l2 http://YOURDOMAIN.COM/ -o urlinfolist.txt-r: recursive; -l2: depth (if not set, depth = unlimited). See wget manual page.
Then extracted the URLS from the wget-log that is generated:
cat wget-log | tr ' ' '\012' | grep "^http" | egrep -vi "[?]|[.]jpg$" | sort -u > urllist.txtor
cat urlinfolist.txt | tr ' ' '\012' | grep "^http" | egrep -vi "[?]|[.]jpg$" | sort -u > urllist.txtNote: Some of the exclusions I had in my line were not needed because the config file either already excluded them or could very easily exclude them.
Then, ran the generator:
python sitemap_gen.py --config=example_config.xml Which produced the sitemap.xml file.
The script looks to be designed to run in an automated fashion. But it worked for my test run. The wget can take a while to run. However, if you don't have any special rewrites/etc, you can just scan your site's static content path with a 'find' and maybe do some filtering on it before dumping it into the url list file.


分享到:  QQ好友和群QQ好友和群 QQ空间QQ空间 腾讯微博腾讯微博 腾讯朋友腾讯朋友
收藏收藏 转播转播 分享分享 分享淘帖
您需要登录后才可以回帖 登录 | 立即注册

本版积分规则

小黑屋|手机版|Archiver|典范英语|Oxford在线词典|在线查词典|每日签到|浙北英语网 ( 蜀ICP备10032068号 )

GMT+8, 2024-12-29 08:44 , Processed in 0.067080 second(s), 32 queries .

Powered by Discuz! X3.3

© 2001-2017 Comsenz Inc.

快速回复 返回顶部 返回列表