Did you trying Google'ing this? First result on the first page: https://code.google.com/p/sitemap-generators/wiki/SitemapGenerators Edit: As per the comments, I tried out the following sitemap generator: http://sitemap-generators.googlecode.com/svn/trunk/docs/en/sitemap-generator.html The downloaded zip bundle contains a few files: drwxr-xr-x 19 user group 646 Apr 10 05:22 .drwxr-xr-x 3 user group 102 Apr 10 05:12 ..-r--r-----@ 1 user group 23 Jun 16 2005 AUTHORS-r--r-----@ 1 user group 1791 Jun 16 2005 COPYING-r--r--r--@ 1 user group 2267 Dec 5 2005 ChangeLog-rw-r--r--@ 1 user group 258 Dec 5 2005 PKG-INFO-r--r--r--@ 1 user group 1111 Dec 5 2005 READMEdrwxr-xr-x 3 user group 102 Apr 10 05:16 build-r--r--r--@ 1 user group 5662 Sep 7 2005 example_config.xml-r--r-----@ 1 user group 996 Jun 16 2005 example_urllist.txt-r-xr-xr-x@ 1 user group 317 Dec 5 2005 setup.py-r-xr-xr-x@ 1 user group 73063 Dec 5 2005 sitemap_gen.py-r-xr-xr-x@ 1 user group 28551 Sep 7 2005 test_sitemap_gen.pyUsing the provided example_config.xml, I modified it in the following manner: <?xml version="1.0" encoding="UTF-8"?><site base_url="http://YOURDOMAIN.com/" store_into="/var/www/sitemap_gen-1.4/sitemap.xml" verbose="1" > <url href="http://YOURDOMAIN.com/stats?q=name" /> <url href="http://YOURDOMAIN.com/stats?q=age" lastmod="2004-11-14T01:00:00-07:00" changefreq="yearly" priority="0.3" /> <urllist path="urllist.txt" encoding="UTF-8" /> <!-- Exclude URLs that end with a '~' (IE: emacs backup files) --> <filter action="drop" type="wildcard" pattern="*~" /> <!-- Exclude URLs within UNIX-style hidden files or directories --> <filter action="drop" type="regexp" pattern="/\.[^/]*" /></site>I think that serves as the template for generating the sitemap.xml. Now, the generator supports pulling URLS from apache style access logs or pulling from a url list file. I opted to pull from a url list file, since I was testing from my laptop. To generate the url list, I employed 'wget' to spider the site: wget -mk --spider -r -l2 http://YOURDOMAIN.COM/or wget -mk --spider -r -l2 http://YOURDOMAIN.COM/ -o urlinfolist.txt-r: recursive; -l2: depth (if not set, depth = unlimited). See wget manual page. Then extracted the URLS from the wget-log that is generated: cat wget-log | tr ' ' '\012' | grep "^http" | egrep -vi "[?]|[.]jpg$" | sort -u > urllist.txtor cat urlinfolist.txt | tr ' ' '\012' | grep "^http" | egrep -vi "[?]|[.]jpg$" | sort -u > urllist.txtNote: Some of the exclusions I had in my line were not needed because the config file either already excluded them or could very easily exclude them. Then, ran the generator: python sitemap_gen.py --config=example_config.xml Which produced the sitemap.xml file. The script looks to be designed to run in an automated fashion. But it worked for my test run. The wget can take a while to run. However, if you don't have any special rewrites/etc, you can just scan your site's static content path with a 'find' and maybe do some filtering on it before dumping it into the url list file. |
欢迎光临 浙北英语网 (https://zbenglish.net/) | Powered by Discuz! X3.3 |