Work post

Feb. 27th, 2008 01:49 pm
pyesetz: (Default)
[personal profile] pyesetz

Lorem ipsum dolor sit amet.  This is some pointless text to go with the animated elephant, which was drawn by Vincent Pontier, who is apparently friends with Oliver Plathey who wrote the FPDF module for PHP.

When I started working at Company 𝔾 in June '06, I didn't know anything about MySQL, so "Mr. Bear" suggested that I use XML instead.  Being a n00b, I stupidly chose the DOMXML module, which is specific to PHP4 and has prevented us from migrating to PHP5.  Now PHP4 is at "end of life" and we really need to get rid of it.  Also the nightly validation report has started failing because even 48 MB of RAM is no longer enough to load in all the databases and check them for validity (because DOMXML has serious memory-leak issues).

This month I've been busting my tail, sniffing the grindstone, trying to get my billable hours up high enough to staunch the financial bleeding.  My RSS feed of LiveJournal posts now has 266 289 unread messages.  Last weekend I had gotten it down to below 100 but the posts just keep coming!  Oddly enough, there is no post from [livejournal.com profile] xolo about his Christmas celebrations.  And no reply for my email to [livejournal.com profile] loganberrybunny re the use of state/province to refer to the "home countries" of the UK; has offence been taken?

For the last three days I've been removing XML and replacing it with PHP arrays.  Instead of
<entry>
<_key>deadbeef</_key>
<name>Joe Schmoe</name>
<country>USA</country>
<region>Colorado</region>
<city>Pike's Peak</city>
</entry>
it's now
$DB_Entries['deadbeef'] = array (
'name' => 'Joe Schmoe',
'country' => 'USA',
'region' => 'Colorado',
'city' => 'Pike\'s Peak',
);

Looks about the same, loads 10‒20 times faster!  Really, this is a serious demerit for XML.  The whole point of XML was that, since everyone would use it, it was worthwhile to optimize the Hell out of the parser for it, so then XML should be parsed faster than any other format.  But PHP's program-code parser is much faster than their XML parser.  In part I think this is because of the attributes.  XML tags have optional attributes, even though I don't use them, so the isomorphism between XML and PHP-arrays potentially could fail, although it doesn't in my case.

Making this change required touching just about every actively-used file at the website!  All the databases needed conversion.  Any program that reads databases needed to start reading them the new way.  Any program that generates databases needed to start writing them the new way.  I've probably introduced dozens of bugs.  We'll see if I get any bug reports.  Most pages at the website are now served in only ⅓ the time they used to take; will anyone notice?

And so, with this little side-problem taken care of, I can get back to the main project for this winter.  Unlike most Company 𝔾 projects, this one actually has a deadline because there's a conference in May that it has to be ready for.  I need to convert various files to PDF and then combine the PDFs on the fly, hence the need for FPDF, whose homepage has an elePHPant and a link to Vincent Pontier, hence this post.  Bye!

Date: 2008-02-27 10:42 pm (UTC)
From: [identity profile] giza.livejournal.com
My understanding of XML is that it was intended more as a data interchange format, and not something to store data in full time. With a syntax as complex as XML, that's a lot parsing involved.

As it stands with your rewrite, there's still PHP code that is going to be parsed on every page load. Why not use seralize() (http://us.php.net/serialize) to store your data in a format which needs even less parsing? If there are concerns about human-readability of the data, you could always check the timestamp of the PHP source and only re-generate the serialized data if the serialized data value is missing/non-existent.

I wrote a cache system awhile back that stored large pieces of data (more than 100K in size) in files as serialized strings. PHP loaded and ran unserialize() on them amazingly fast, in less than a second. It was nice.

Alternatively, why not just put everything into a database? :-) Since you mention formerly not knowing anything about MySQL, I'm assuming that you have some knowledge of it now.

Edited Date: 2008-02-27 10:43 pm (UTC)

Date: 2008-02-28 01:20 am (UTC)
From: [identity profile] giza.livejournal.com
Maybe I didn't make it clear in my original comment, but the serialized files would not be intended to be edited by hand. You'd still edit the PHP files by hand, but they would only by read by your code on the first pass, at which point serialized copies of the data would be written and then read on future passes. The code would look something like this:

if (!cache_file()) {
   $data = read_php_file();
   write_cache_file($data);
} else {
   $data = read_cache_file();
}


In this example, cache files == the serialized data. That is essentially the same algorithm that I implemented on a prokect awhile back so I could avoid some particularly expensive database queries.

> It insists on putting all its databases in a secret place (typically
> /var/lib/mysql) instead of storing them in the directories where they'll
> be used—this makes it a bit more involved to archive an entire project.

That's perfectly normal for any database. You never want to archive the database files directly anyway since those files are subject to modification while the database server is still running--you'd have to shut down the database server to perform a backup. (Sidenote, Oracle actually let you put tablespaces in "archive mode" during which changes would be writen to a separate log instead. But that was insane)

What you want to do for archival is write a short script that calls the mysqldump program and dumps the entire database to a secure location.

> The program is a behemoth. It's reasonably fast once it gets going, but
> the first query after an idle period takes several seconds (on a shared
> server) while all that code gets swapped in.

From what you describe, it sounds like the server is running out of RAM. That could be either physical RAM, or memory allocated to MySQL. A properly tuned database server will have plenty of RAM so most if not all of the database can be loaded into RAM, thus avoiding the problem you described. To clarify: this is not MySQL's fault, it is a database administration issue.

> Mr. Bear's preference is for a "punctuation optional" search algorithm
> that doesn't match standard SQL operators, so everything would need to be
> stored twice ("search form" and "display form"). It's easy enough to
> write a custom search-engine in C for readable data like XML or PHP
> arrays, not so easy to try to read MySQL data files directly!

I'm not sure I understand this last part. You should never ever be reading MySQL database files directly, that's what SQL is for.

I think the best advice I can offer here would be to get your hands on a book on SQL and start studying up on it. Understanding table joins and how relational databases work is an essential skill for web development, and it will help you build larger sites.

Profile

pyesetz: (Default)
Pyesetz/Песец

August 2025

S M T W T F S
      12
3456789
1011 1213141516
17181920212223
24252627282930
31      

Most Popular Tags

Style Credit

Expand Cut Tags

No cut tags
Page generated Mar. 18th, 2026 02:47 pm
Powered by Dreamwidth Studios