- 2007-05-16
-
13:30 /people/disk2 is online again.
13:15 /people/disk2 is down.
- 2007-05-09
-
22:00 /people/disk2 is online again.
18:00 /people/disk2 is down.
- 2007-03-01
-
/people/disk1 is now available again.
- 2007-02-26
-
The cluster has been restarted, and is 'open for business'. We do have
a problem with /people/disk1, with the result that kvljg affiliated
users at present cannot run jobs.
- 2007-02-24
-
At about 6:09 this morning the cluster was hit by a powerfailure. As soon as
weather (and road conditions) permit, the cluster will be restarted.
- 2006-08-30
-
A new FAQ web-page is now available with a couple of recommendations for
how-to use the new multi-core nodes more effectively. Please visit:
Jobs on multi-core machines.
Also visit the new Nodeload
web-page where jobs can be inspected for efficient use of the nodes.
- 2006-08-21
-
The Horseshoe is now open for business. Please use fe4.dcsc.sdu.dk
as the new frontend to compile and manage your jobs. fe4 is binary
compatible with the compute nodes. fe1 and fe2 will be turned off.
Since the cluster more or less is starting from scratch (we are now
using OpenSuSE 10.1 as the OS, the version of TORQUE/MAUI for queue
management is also a bit different) please be patient and report
any problems you may find when using the system. The queue
commands (qsub, qdel, and qstat) should behave as we are used to.
Please observe that your application will benefit from a recompilation
since the CPU's on the compute nodes are Intel Woodcrest, in the
family of EMT64 Intel CPUs. We'll send out more information about this
shortly.
- 2006-08-19
-
The RAID system /people/disk2 has been repaired - it's now
possible to log on to the old frontend machines (fe1 and fe2) - and
to the new frontend fe4.
The new frontend (fe4) has the newest versions of the Intel,
PortLand Group, and Gnu compilers to create binaries compatible with
x86_64 architecture of our new compute nodes.
- 2006-08-17
-
We expect to open the Horseshoe for normal business on Monday 8/21
at 10.00AM.
We have some outstanding issues with the fileserver
called disk2 which we expect to be resolved before Monday.
As soon as /people/disk2 is available again we'll post
a notice on this web-site.
The Horseshoe will, until Monday morning, be in use for producing the
best possible Linpack score for a listing on
top500.
- 2006-08-09
-
Today we finished mounting the new servers in the racks and reconnecting
the network. We still have to adjust BIOS setting and install a operating
system on the 200 servers. When this has been accomplished we'll attempt
to get Horseshoe listed on Top500, before we are back in normal operation.
- 2006-08-06
-
We hope to take delivery of 200 servers from Dell Monday 8/7.
The racks and electric installations are almost in place. In a couple of
days we hope to announce a date for normal resumption of service on the
Horseshoe cluster.
- 2006-07-26
-
Horseshoe remodeling:
giga2 will close on 31/7/06 8AM. A computer broker will come to
collect the PC's.
Hoeseshoe will reopen again in the middle of August with 200
quad-core (woodcrest) servers supplied by Dell.
Until the reopening the fileservers should be on-line most of
the time.
- 2006-06-15
-
15:30PM
The Horseshoe cluster should be ready for business again.
- 2006-06-15
-
Yesterday the SDU campus suffered a total loss of electricity,
consequently all nodes and servers stopped. Since we still have
a unstable power supply situation, the nodes, disk4, disk5, disk6,
and disk7 will remain powered down for the time being.
- 2006-06-01
-
On Thursday 6/1/06 at 8AM the nodes constituting the workq and
giga queues will be turned off. After that time only the SDU and KVL
based groups will have access to the Horseshoe computing resources. Users
belonging to other groups can still log in and retrieve files.
- 2006-05-22
-
10:30PM. The testing (and repair) of /people/disk2 terminated
at about 8:50PM. After restarting the fileserver it became clear that the
nodes had invalid (NFS) mountpoints, thus necessitating a reboot of all
nodes.
The cluster should be ready for use again.
- 2006-05-22
-
The filesystem on /people/disk2 has developed a problem requiring
a thorough testing of the integrity. This should be completed before 8PM
local danish time.
- 2006-05-11
-
Update
A leak has been found in the compressors coolant loop - we don't
have an ETA for the repairs - since tomorrow is a holiday in
Denmark, we do not expect any news until Monday next week.
The cooling compressor has lost it's supply of coolant. We expect
the compressor to be back in operation at about 2PM.
- 2006-05-10
-
At 7PM we experienced a powerfailure in the room housing the giga2
machines aand the fileservers disk4,..,disk7. Most of the systems are
affected. Facilities management will be notified tomorrow morning.
- 2006-03-13
-
disk2 had NFS problems. Restarted.
- 2006-01-26
-
For reasons which cannot be determined disk2 dropped the filesystem
for /people/disk2 to night.
The filesystem has been repaired and is online again.
- 2006-01-19 - 23:00PM
-
The fileserver disk2 has been unstable since about 6PM this evening,
probably due to the dropout of one of the raidtowers supporting
/people/disk2. The problem will be attended to ASAP.
- 2005-10-17
-
The migration to TORQUE has (more or less) been completed. We still have
some nodes which for (hardware-) technical reasons have to be "debugged".
- 2005-10-13
-
TORQUE migration update:
As of noon we have completed the migration in queues giga and giga2. We have
migrated 338 nodes in workq, and are still waiting for 59 old PBSPro jobs to
terminate.
- 2005-10-12
-
TORQUE migration update:
As of noon we have migrated 171 nodes in workq, 136 nodes in giga, and
260 in giga2 to TORQUE management. Thus all functional nodes in giga
has been dealt with. We still have some nodes in giga2, which for
(hardware-)technical reasons cannot be migrated.
- 2005-10-11
-
As of noon we have migrated 43 nodes in workq, 64 nodes in giga, and 97
nodes in giga2 to TORQUE management. Nodes will be added as the old PBSPro
jobs terminate.
Please remember that standard output and standard error (the .o and .e
files) will be delivered when the old job terminates, but an e-mail will
not be send.
Hence you should check
http://www.dcsc.sdu.dk/docs/load/PBSPRO/expanded_queue_info.php
once in a while to remember: a) which jobs you had running and b) which
jobs you had queued as of yesterday @1PM.
Job scripts (and job control files) for old jobs can be reclaimed here:
Scripts
- 2005-10-03
-
Migration to TORQUE queuing system on 10/10/05
The license for the PBSPro queuing system used on The Horseshoe is about to
expire. Until now we have been granted a free academic license by Altair
Engineering. However, Altair has changed it's policy and are now asking us
for a large fee to maintain the license.
This is not feasible for us, so on Monday 10/10 @ 1PM we'll migrate to the
open-source TORQUE queuing system in order to maintain a similar user
interface. In that respect we are following in the footsteps of many large
academic computing centers formerly using PBSPro. In fact the TORQUE
version of 'qsub', 'qstat', and 'qdel' commands are (almost) identical to
the PBSPro counterparts.
The plan for the migration is as follows:
- Running jobs at 1PM Monday 10/10 will be allowed to finish.
However, there will be no e-mails from the system upon job
termination, if this was requested at time of job-submission.
- Jobs in the queued state at 1PM Monday 10/10 CANNOT BE MIGRATED to
TORQUE, since the format of the job-control files are too
different. This unfortunately implies that queued jobs have to be
resubmitted.
- When PBSPro jobs terminate the nodes allocated to them will be placed
under TORQUE control, which implies that at about 3PM Wednesday 12/10
all nodes in 'giga' and 'giga2' will have been migrated to TORQUE, and
at the latest Tuesday 18/10 the last of the 'workq' nodes will be under
TORQUE control.
- Our scheduler, MAUI, will migrate seamlessly to use TORQUE such that
the fairshare information at the time of changeover will be preserved.
To help figuring out which jobs are running or queued at the time PBSPro is
"turned off" the "Expanded queue information" webpage
(
expanded_queue_info.php) will be saved here:
http://www.dcsc.sdu.dk/docs/load/PBSPRO/expanded_queue_info.php
For each queued job we'll also copy the submitted jobscript to
http://www.dcsc.sdu.dk/docs/load/PBSPRO/
where they will be listed as job-id.infra.SC. This will hopefully
be of help to you when resubmitting jobs or submitting new jobs.
We realize this is a fairly short notice, and that the migration is not
entirely smooth, due to the fact that we cannot migrate queued PBSPro jobs
to TORQUE. However, we hope that the migration to TORQUE can happen without
too many problems.
- 2005-09-12
-
The upgrade of the raidtowers has been concluded successfully,
albeit our timetable didn't hold (we finished about 3 hours later
than planned). We hope that the new raidtowers will prove more
stable, and that the additional diskspace will find its use.
- 2005-08-31
-
Upgrade on 12/9/05 - deployment of new
raidtowers
We have purchased new raid-towers to replace the 2 AT1600 systems which
have caused us quite a bit of headache, and one X3i system, as it's
3 year warranty period has expired.
In order to perform the upgrade we need to bring all running jobs to a halt,
to ensure consistent filesystems while the raidtowers are replaced. The
bad news is that we need about 3 hours to complete the installation, the
good news is that we can enlarge the following partitions:
/people/disk2 from 1,92TB to about 2,33TB
/people/disk1 from 1,50TB to about 2,33TB
/wkspace2 from 1,92TB to about 3,26TB
/wkspace1 will not receive an upgrade this time around as we'll
continue to use the old X3i tower (and use the just replaced X3i as a
spare parts repository, until a replacement can be funded - hopefully
soon).
The plan is thus to shutdown the cluster from 9am until approximately
noon on 12/9/05. HOWEVER, ALL RUNNING JOBS ON 12/9/05 AT 9AM WILL BE
FLUSHED OUT OF THE QUEUES. The queues will remain open to receive new
submitted job - but if you think that a queued job should NOT start
until after the upgrade of the filesystems it can be put on hold with
the command:
rsh infra /usr/pbs/bin/qhold -h u job-id
and later released with:
rsh infra /usr/pbs/bin/qrls -h u job-id
- 2005-07-21
-
The giga2 is online again.
The situation is not resolved but will be further investigated when
personnel return for summer vacation.
- 2005-07-20
-
Again we lost a phase in one of our 63A feed for the
giga2 machines.
About 33% of the machines were without power so long that the
power supply had to give up.
The electrics guys are informed.
The giga2 queue is suspended until this is investigated.
When is unknown as of this late time (16:45).
- 2005-07-12
-
To correct the "missing T-phase" issue all machines in giga2
will have to be shut off.
This will happen (still unconfirmed)
Wednesday July 13 shortly
after noon.
The giga2 queue is currently suspended and the currently
running jobs will have sufficient time to terminate normally.
giga2 will be online again Wednesday afternoon.
- 2005-07-11
-
It turns out that the "blown fuses" below is a missing T-phase on
14 out of 20 groups.
The repair necessitates a full shutdown of all electricity in the
room covering giga2 including the servers, disk4-7,
positioned in the same room.
The necessary spare-parts are not at hand but we have been informed
that the electricians will be able to perform the repair Wednesday
afternoon. This still has to be confirmed though.
Users will be notified by email.
- 2005-07-10
-
Between 13:43:12 and 13:48:29 we had no power in the room
holding the giga2 machines.
Not all machines are up due to blown fuses. This will be
corrected as soon as possible.
- 2005-05-25
-
Unfortunately the RAID-tower housing /wkspace1 has suffered the
loss of 2 disks simultaneously - since the partition is a RAID-5 device,
this implies the total loss of data.
We are in contact with the vendor to have the faulty disks replaced
as soon as possible.
- 2005-05-18 8:05am
-
The RAID-tower housing /wkspace1 has crashed, due to a faulty
disk during reconstruction of the RAID5 partition (brought on by another
faulty disk). We'll initiate repairs as soon as possible.
- 2005-03-09 11:05pm
-
The filesystem check of /people/disk2 has finally terminated.
It appears that no files have been lost. The filesystem is thus again
available for use.
- 2005-03-09 09:55am
-
Once again we lost one of the Fibrenetix RAID boxes due to a simple
disk failure.
In order to stablize the SCSI bus we have restarted disk2
but unfortunately the boot process has seen some errors in the
/people/disk2 filesystem and is therefore currently running
the check program. This will take some hours.
- 2005-03-01
-
Due to an incident where infra got unstable we had to perform at hard
reboot. Unfortunately we lost all running jobs in the process.
- 2005-01-17 Continuation of the cluster.
-
DCSC has approved our application for extending the
lifetime of the initial pool of 512 machines to 1/6-2006.
It should be noted, however, that the warranty expires
1/8-2005, and no hardware replacements will be done
after that date. Machines that die after that date will
be cannibalized to keep as many machines running as possible.
- 2004-11-18 Odense went black for 20 minutes
-
but we only had a power surge between 7:21:17 and 7:21:28.
Nevertheless we lost all nodes. All servers stayed online due to UPSes.
- 2004-11-11
-
11PM: /people/disk2 is back on-line again. Currently the RAID-1
stabilization of /people/disk2 is being regenerated which takes some
performance.
- 2004-11-11
-
The raidsystem housing /people/disk2 has lost a hard-disk, this
triggered a SCSI error on the server (disk2), such that the partition
is unavailable. The raidsystem is currently (at 5PM) rebuilding the
raid device - we have to wait for this to complete before we can verify
the filesystem, and release /people/disk2 for use. We expect this to
finish at the latest 11AM 11/12/2004.
- 2004-10-28
-
The Portland Group Compiler has been upgraded to version 5.2.
- 2004-10-14
-
It has been arranged with SDU physical facilities that the faulty
fuse will be changed Friday October 15 at 2PM.
- 2004-10-13
-
Update on powerfailure situation.
To repair the broken main fuse we have to cut power to all giga2
nodes. All giga2 nodes have thus been offlined - and the repairs will
commence as soon as giga2 has been drained for running jobs.
- 2004-10-13
-
Looks like we are hit by powerfailure again - a main circuit
which supply power to 14 power panels in giga2 has switched off.
About 60 nodes are down due to this problem. We may have to drain
giga2 in order to repair the circuitry.
- 2004-10-10
-
There is a problem with some of the electrical installations. This
evening the cluster lost it's network connection since the media
converter from fiber to ethernet had it's power supply connected
to one of the power strips with a bad fuse - this has been corrected.
However nodes on switch01, switch02, switch03, switch06, and switch07
are affected by the power failure.
- 2004-09-23
-
We have upgraded the NFS server software on disk2 - we hope this will
solve the problem with the unstable NFS service. However, we still have
to reboot about 200 nodes to get rid of hanging processes.
- 2004-09-22
-
We have during the last 24 hours (or so) experienced
problems with NFS services from disk2. We are in the process
of determining the cause of the problem.
- 2004-09-18
-
IBM strongly recommends that we upgrade the BIOS of machines in giga2
to avoid some erratic behaviour like network interfaces disappearing,
sudden stalls and power-offs.
We have seen all these things happen and have therefore decided to
offline all giga2 machines Sunday morning so that any job will have
finished Tuesday morning. We can do the upgrade simply by rebooting the
machines so it should not be a whole-day operation.
- 2004-09-15
-
One of our key fileservers (disk2) had ceased to service NFS earlier
today - it has been rebooted, but many nodes are affected by hanging
processes and they must also be rebooted. We have identified the
affected nodes (approx. 780 in number). They have been offlined, and
will be rebooted and released back to the PBS queues as soon as
possible.
- 2004-09-06
-
Users meeting
The users meeting is Thursday 9/9 at 13.15 at the Kollokvirum at the
department of chemistry, SDU (same place as last).
Agenda:
- The technical setup of the new 304 nodes ("giga2").
- Job scheduling and queue setup (workq, giga and giga2).
- Future for the 512 "workq" nodes.
- Mics.
Regarding point 3): The first 512 nodes were made operational 1/9-2002 and
the funding for running these will expire 1/9-2005. Without any active
measures, these computing resources will disappear by 1/9-2005, and the
research groups behind the original application will loose their access to
the system. There are three basic options:
- The 512 nodes disappear by 1/9-2005 and are not replaced.
- An application is made for funding for running the 512 nodes for
e.g another year (i.e. electric power and manpower).
- The groups behind the original proposal make individual or a common
application for new hardware according to their estimated
computational needs for the next 3 years.
Options 2) and 3) should be aware of the 1/11-2004 application deadline to
DCSC, i.e. applications must be send in by this date to ensure a continuing
supply of computational resources after 1/9-2005.
- 2004-09-03
-
Queue giga2 is now open.
LAM has been upgraded to version 7.0.6, and is now using the
Gnu compilers (gcc, g77, c++) as default compilers. Please read
LAM-MPI.
- 2004-08-28
-
Horseshoe reopens
The cluster is now partially open for business - there are a couple of
more details on the new machines in the 'giga2' queue to deal with, but
'workq' and 'giga' is available (jobs can be submitted to the 'giga2'
queue, but will not start).
The web-pages have to be updated to accommodate the new 'giga2' queue,
especially the 'expanded queue info' will not be correct for a little
while. Other web-pages in the 'Docs' hierarchy also have to updated and
a few new how-to's added.
Summary of new features:
-
All nodes and frontends have been upgraded to a new version of
Linux (Debian Sarge). This MAY break some of you applications,
because the fundamental system libraries have been upgraded.
However the upgrade was (very) necessary ! The problem has to be
dealt with by recompiling the application.
-
PBS now supports more than 128 nodes per job - the limit has been
set to 256 nodes in 'workq' and 'giga2'.
-
giga2 (when open for business) has a 50 hour wall-clock limit as
is the case for 'giga'.
-
We are introducing a procedure for alleviating
switch-fragmentation
when scheduling multi-cpu jobs.
Due to the addition of nodes, the fairshare targets for the groups
has been adjusted to reflect the new allocations on the cluster.
The statistics for the last five fairshaire windows has been removed
since it cannot be rescaled to accommodate the new number of nodes.
Please remember that there is a usersmeeting Thursday 9/9 at 13.15 at SDU
in the kollokvierummet at the Department of Chemistry, where the scheduling
strategy and other issues can be discussed.
- 2004-08-27
-
9.30PM - UPDATE on the cluster upgrade.
The installation of the nodes with a upgraded version of Linux
(Debian Sarge) has been completed. However before we can start
the PBS queueing system as few checks have to be performed.
As soon as the cluster is ready we'll send out a e-mail.
- 2004-08-13
-
The Horseshoe is being enlarged with 304 additional nodes. We have
now received all 304 PC (3,2 Ghz P4, 1 and 2 GB RAM, 80GB disk, and
Gb ethernet).
The time table for the merging of the old and new cluster is as follows:
25/8: All running jobs in giga, workq, and
express are purged, however queued jobs remain in the PBS database
ready for the reopening.
27/8: The Horseshoe will be open for business again, now with
an additional queue 'giga2'. All queued jobs are released
and will be executed according to their priorities as usual.
This implies that jobs submitted from now until 25/8 cannot
expect to complete unless they start with a walltime request
which allow them to finish before 25/8.
If you want a job submitted between now and 25/8 to remain
queued until after 27/8, you can force it to remain queued
by issuing the command:
'rsh infra /usr/pbs/bin/qhold -h u job-id'
These 'suspended' jobs can be located by using the 'qstat' command -
the jobs will be in state 'H'. A 'suspended' job can be released to
state 'Q' by using the command:
'rsh infra /usr/pbs/bin/qrls -h u job-id'
We'll use the new nodes and the time between now and until 25/8 to
evaluate new versions of our queuing software (PBSPro and MAUI/Moab),
test new versions of Linux, and tune a version of Linpack for a
benchmark run on all cluster nodes between 25/8 and 27/8 in order
to place the Horseshoe on the Top500 list of Supercomputer installations
(http://www.top500.org).
Once the enlarged Horseshoe reopens, the target %-shares for each user
group will be adjusted to reflect the added capacity and we'll announce
some new procedures when submitting multi-cpu jobs in order to alleviate
the phenomenon of "switch-fragmentation", i.e. the allocation of few
nodes on many switches for the job.
We will like to remind you of the user-meeting at SDU (kollokvierummet at
the Department of Chemistry) Thursday 9/9 at 13.15. We'll announce an
agenda at a later time.
- 2004-07-30
-
In the last 24 hours we have witnessed 2 network related events
affecting The Horseshoe:
- At about 8:40 in the morning yesterday we saw a tremendous
surge in network traffic on the internal cluster network,
which caused a kernel crash on about half the nodes - they
had to be rebooted. We are not sure what the cause of the
network problem was.
- At about 15:45 yesterday the SDU Campus Network Programmer
was forced to shutdown all network traffic in and out of campus
due to attacks against the SDU Active Directory structure.
The network was reopened again at about 10:45 today.
- 2004-05-25
-
Infra (PBS and web server) has to have it's motherboard replaced. The
server will be down for about 1/2 hour at some point before lunch. The
commands 'qsub', 'qstat', and other PBS related commands will not be
available. The webpages will also not be updated for that period
of time. Running jobs will continue for the duration of the shutdown.
- 2004-05-17
-
/wkspace1 is again available.
As previously announced we could not save the original data on
/wkspace1.
- 2004-04-21
-
Update on the situation with /wkspace1.
Unfortunately we have to accept that the data on the partition is
lost. We've tried for 5 days to recover the data - but during the
process several additional disks have crashed. We are in contact with
the vendor to replace the faulty disks ASAP. We'll bring
/wkspace1 online as soon as the disks have been replaced (but
without the old data).
- 2004-04-15
-
The RAID-Tower housing /wkspace1 experienced a disk-failure last evening.
The RAID-Tower responded by activating the 'hot spare' disk and
initiating a rebuild of the RAID5 array - however during the rebuild
errors on multiple other disks has been encountered.
We'll contact the vendor to obtain replacement disks. We are not sure
of the integrity of data on /wkspace1.
- 2004-03-06
-
The switch where all front ends are attached started continuously
rebooting at around 6:30 this morning making it impossible for users
to log onto the system to check job and submitting new.
Another switch has been replaced.
- 2004-02-05
-
1 node jobs in queue giga.
We have observed that at times the machines in the
'giga' queue have been somewhat underutilized.
This is in part a consequence of the fact that the 'giga'
queue only has 140 machines, and we allow up to
128 nodes per job. The limit of 4 nodes per job
also inhibits an effective backfilling with small jobs.
As a test we have lifted the requirement of 4 nodes
per job in 'giga' to see if this will improve the
utilization. The 50 hour runtime limit remains.
- 2004-01-05
-
During the Christmas holidays many of you have experienced
very slow access to the filesystems attached to the server
'disk2'. The filesystems in question are '/home/disk2' and
'/wkspace2'.
This was caused by excessive usage of these filesystems by
the running jobs. It is important to realize that the
connection between nodes and the file-server is a resource
which has to be used sparingly.
Please note that all jobs are required to use the
local disk(s) on the node(s) allocated to the jobs, except
for initial and final transfer of data. The example PBS scripts
available on the website all show how to
do this.
- 2003-11-19
-
Disk2 is up again, and the cluster
is ready for 'business'.
- 2003-11-19
-
Disk2 is down. Late yesterday afternoon there was an event
on the SCSI bus connecting our RAID-towers to disk2. We are
in the process of scanning the filesystems, a process which
should be done later this afternoon.
- 2003-11-18
-
As a followup on yesterdays news regarding
The Current Backfill Window, there is a new
"howto" describing a few tricks which might allow your queued jobs to start sooner:
Please read Job resource request management.
- 2003-11-17
-
The Current Backfill Window
The queue information web-page
now features a section on the current
backfill window, which
lists the resources (number of nodes and duration) available to run jobs
right now, without disturbing the start-time of
the currently highest prioritized (queued) job.
This information can be used to have jobs scheduled right away,
e.g. for test jobs or jobs which have a flexible need for number of nodes and/or
walltime consumption.
Please also read about the
scheduling strategy.
- 2003-10-08
-
/wkspace2 and 40 giga-nodes are available again.
We have finally completed the journey to recover from the
collapse of the RAID towers late July. As of this morning /wkspace2
and the 40 'giga' nodes which had emergency-backups of /wkspace2
are now again available for Scientific Computing.
Thank you for your patience during this arduous process.
- 2003-10-03
-
/home/disk1 and /wkspace1 are available again.
Disk1 is now "open" for business again: /home/disk1 and /wkspace1
are now available. For those of you having files on these
systems, please check that all is as expected.
Those who placed a (user) hold on jobs in the queue because the jobs
needed access to the above mentioned filesystems: Use the command
qrls to release the 'userhold' on these jobs:
rsh infra /usr/pbs/bin/qrls -h u job-id
Use the qstat command to locate these jobs, they will be
in state H.
- 2003-09-23
-
Unavailability of /home/disk1 and /wkspace1 26/9
until 3/10.
We are now at the next to last step in recovering after the terrible
crash of our RAID-Towers on disk2. We have to restructure the filesystems
on disk1. We have to take /home/disk1 and /wkspace1 offline for about 5
working days to build-in a new RAID-Tower and secure /home/disk1 on a
mirrored RAID-Tower setup.
These filesystems will be unavailable starting Friday 9/26 8AM.
disk1 should be fully operational the following Friday.
We ask that you do not submit jobs which ask for either
/home/disk1 and /wkspace1 until Friday 3/10.
If you have jobs in queue (*not* running) which require the use of disk1
resources: Please place a hold on those jobs - use qstat to
locate the
job-id's of these jobs - and then use qhold to place a
'userhold' on these jobs:
rsh infra /usr/pbs/bin/qhold -h u job-id
- 2003-09-03
-
The Horseshoe will reopen tomorrow September 4'th at 2PM.
However, we are not quite finished with the process of restructuring
and recovering, thus the reopening tomorrow has the following caveats
attached:
- /wkspace2 will not be available for about two weeks.
- 40 'giga' nodes will be offline until /wkspace2 is restored (they are
used as buffer storage for the 1,4TB of data originating from /wkspace2).
- /home/disk1 and /wkspace1 has not been restructured.
These issues will be resolved during the next few weeks, as we receive
refurbished RAID towers from the vendor. More information will be posted
as the restructuring plan progresses.
- 2003-08-28
-
Approximately 1,2TB of data on /home/disk2 has
been recovered. Most of the directory structure also
seems to be intact. In other words, we feel confident to announce:
Most of /home/disk2 has been salvaged.
In the meantime we have taken delivery of two additional
RAID towers. These RAID towers have to be tested and installed
before access and normal operations resume on the cluster.
In addition the faulty hard-disks in the existing RAID towers
have to be exchanged. An ETA for the re-opening of
The Horseshoe will be posted very soon.
- 2003-08-22
-
The recovery efforts for /home/disk2 continue despite
some setbacks. We will poste more about the situation after
the weekend.
We have issued a purchase order for two additional RAID towers,
which will be delivered next week. Using those new devices
/home/disk1 and /home/disk2 will be reconfigured
to be hosted on mirrored RAID towers, which will improve
the ability to survive disk and RAID tower failures.
- 2003-08-18
-
We are close to formulating a plan for:
- Recovery of data on /home/disk2.
- Resuming service on The Horseshoe.
The recovery operation is at the stage where we have a pretty good
overview of the amount of low-level information we are able to
extract from the RAID towers. How much of the filesystem we are able to
reconstruct, we do not know at this time. The vendor will let us use a
spare RAID tower for our attempts to rebuild the filesystem.
In order to return to normal operation on The Horseshoe, we need to
restructure the layout of filesystems on RAID towers. We are in the
process of negotiating with vendors regarding
delivery of additional RAID systems to improve our ability to survive
hardware failures.
Unfortunately it is to early to offer a realistic timeline
for the tasks outlined above. We will continually post information on
the website as work progresses.
- 2003-08-15
-
The RAID system containing /home/disk2 and /wkspace2
has developed into a nightmare. Apparently we have been supplied
with a system where all 32 disks were of very poor quality.
Since the RAID system just remap data from bad blocks on one disk onto
another, these faults have gone unoticed. On the 30th of July, however,
a run-away user process created ~500 GB of files, completely
filling up /home/disk2. This triggered a sequence of unrecoverable
errors. We initially thought that we could recover by switching
one disk at a time, and rebuilding the file system after each switch.
This was unsuccesful.
At this point we are faced with a total loss of all data on /home/disk2,
but data on /wkspace2 has been salvaged and stored on compute nodes, which
have been taken offline. We are in close contact with the supplier of the
RAID system, trying to recover at least some of the data, but hopes are slim.
Needless to say, this is a major disaster. We have no estimates
of when we will have a definitive answer for the status of /home/disk2,
or when users reciding on /home/disk2 can return to using the machine.
- 2003-08-08
-
We are still hunting for "faulty" disks. A service
representative from the vendor company will come and
help us upgrade the firmware on the RAID towers.
- 2003-08-07
-
The Horseshoe is still down.
We are not at the end of the process of finding "faulty" disks
in the RAID towers.
fe1 will allow logins - but only users with homedirectories
on /home/disk1 can access files.
- 2003-08-05
-
The Horseshoe is still down. We do not yet know when we
return to normal service.
The vendor of the RAID towers has supplied us with a batch of
new hard disks. We are in the process of identifying "faulty"
disks and replacing them.
It's a tedious process, as we can only replace one disk at a time:
When a disk has been replaced, the RAID has to be rebuild, a process
which can take up to 6 hours (we can then replace another).
It is not a present known how many disks we have to exchange.
- 2003-08-02
-
The hardware problems on the RAID systems attached to disk2 are
continuing this morning. We have determined that the problems
are of a such severe character that we have to cease our attempts
to restart the system.
We have contacted the manufacturer to inquire about the next appropriate
steps.
At this stage we do not expect occurrence of data-loss, but due to
instabilities on multiple disks, we need to proceed carefully.
Users having their homedirectories on /home/disk1 should be able to
continue using the cluster, as long as they do not require the use of /wkspace2.
- 2003-08-01
- We are experiencing severe RAID system problems were disks are
dropping out one after the other.
After one incident we had to reset the server which crashed the
filesystem which is being repaired while this is written.
We do not expect disk2 to be up until tomorrow morning,
if all goes well.
- 2003-06-23
- This part of Odense was blessed with a power failure.
The cluster nodes are not stabilized by any UPS hence all nodes lost
power.
- 2003-05-27
-
A new queue has been created on the cluster. It's called express
and should be used for testing purposes only. The 4 frontends are
used as execution hosts for this queue. Jobs submitted to this
queue can only request 20 minutes of walltime.
Since the frontends are used by interactive users, the local /scratch
partition is not guaranteed to be as available as on the compute
nodes.
- 2003-05-12
-
A new webpage
has been created, which presents a listing of queued
jobs based on priority, rather than the chronology of when
jobs were submitted to the queues offered by PBS's
qstat.
The priority based listing is used by
MAUI to make
scheduling decisions, i.e. the order in which jobs are
selected from the queues and allowed to run.
- 2003-05-07
-
As agreed at the usersmeeting 1/5/2003 there are now limits on
the walltime a job can request:
-- 200 hours when submitted to workq
-- 50 hours when submitted to giga
Jobs running and queued as of noon 7/5/2003 will not be affected.
Jobs submitted to the queues after this point in time are subject
to removal. If a job is removed from the queue, the user will be
notified by email.
In case it is not possible to use the method outlined in
PBS multipart jobs
for running very long lasting jobs - please contact
Frank Jensen to request an exception
to the new queue policy.
- 2003-05-02
- We are experiencing severe problems with the queuing
system - bugs have been exposed in the PBS software.
Until the problem with PBS has been fixed there is a
cap of 128 nodes per job in place.
- 2003-04-07
- All of University of Southern Denmark in Odense lost main power at
around 21:02 tonight.
As the computing nodes are not covered by UPS (by decision) all nodes
have been restarted.
- 2003-04-07
- We are investigating stranges problems with the MAUI scheduling systems
which currently does not populate the cluster with jobs in a
deterministic manner.
We hope to be on top of this as soon as possible.
- 2003-04-02
We went on-line again today.
Current users have received information on change in use of the cluster. In particular there are some constraints on the new part of the cluster that will be enforced by the scheduling system.
New users can apply for their account.
The Overview page contains information on the new cluster setup.
- 2003-03-28
We have severe stability problems with the disk systems on both disk servers.
There is no estimate on when we can be back online.
- 2003-03-24
We are working on bringing the system online this minute.
Unfortunately the cooling system entrepreneurs still work on the fan coil in the room.
As Claus also needs to be certain that the new version of PBSpro works satisfactory we need some more time.
Please be patient while checking this page.
- 2003-03-20
Unfortunately we will not be able to start this Friday either.
We hope to have the old machines up and running this Friday but we need to tune the queue system.
The new machines still need to be Linux installed but this can wait till after the old system is brought online again.
A cautious guess would be Monday afternoon, but we need to see our families too.
- 2003-03-16
The cooling system entrepreneur finish their installation work Friday but did not have the time to start the compressor and to tune the new air-flow system.
Due to demands from the cooling system entrepreneur we have not had a chance to install the harddisks and configure the BIOS of the new machines which we hope to be able to do Wednesday.
We have had problems with the new disk systems (and partly still have) which is the reason that we are in the stage of transfering user data now at this late stage.
We also planned to reinstall all the old machines but could not do so due to all the dust in the room which forced us to keep as much as possible off line.
Hence we are a long time from being able to start the cluster especially the extension part.
We antipate to be able to let users start their jobs Friday 2003-03-21 at the latest. This might only be on the old system for a start.
This has taken much more time than we had expected mainly due to design changes in the cooling system installation after 6 days downtime but also due to the disk system problems which were solved this Friday.
- 2003-03-10
Changed rebuild schedule (until revised):
- 2003-03-10: The cooling equipment is not build yet. It is about 50% finished.
- 2003-03-17: New target date for restart (hopefully before then).
- 2003-02-25
Changed rebuild schedule (until revised):
- 2003-02-27: Machine will be shut down for work on the cooling system
- 2003-03-06: The machine will be brought up for benchmarking - this will take 1-2 days.
- 2003-03-10: This is the target for restarting the machine.
The main reason for the change of schedule is a wish from the company doing the installations needed for the extra cooling.
- 2003-02-07
The application form has been revised to cover the new grant holders.
Applicants with CPU year allocations are listed by name. All other grant holders should use Other.
- 2003-02-07
We are really low on remaining disk space. We have 60 GB left in the home directory store
and 23 GB in scratch space.
Until the rebuild you can use /mount/tmp to store data but be warned that /mount/tmp will be erased during rebuild at the end of the month so you have to secure your data from this data store yourself off Horseshoe before February 27th.
- 2003-02-07
Changed rebuild schedule (until revised):
- 2003-02-27: Machine will be shut down for work on the cooling system
- 2003-03-04: The machine will be brought up for benchmarking - this will take 1-2 days.
- 2003-03-06/07: This is the target for restarting the machine.
- 2003-01-28
Current rebuild schedule (until revised):
- 2003-02-27: Machine will be shut down for work on the cooling system
- 2003-03-04: This is the target for restarting the machine.
- 2003-01-27
We have now finalized the plans for extending the Horseshoe
cluster computer. The extension will consist of 140 nodes
(2.66 GHz Pentium-4, 1 Gb DDRRAM, 120 Gb disk) connected by Gb
Ethernet. The installation is expected to take place in week 9 (feb 27-mar 3)
and will require taking the machine down for at
least a couple of days. The exact schedule will depend on
when the nodes are delivered, arrival of the extra cooling
system, additional installation of power, etc.
In addition to the computing upgrade, the disk system for
permanent files will be extended by ~5 TB. We are aware that /home
is running 90+% full, but release of more disk space requires
taking the server offline, thereby interrupting all running jobs.
Since we will be forced to take the machine down within one month anyway,
we will postpone the disk repartition until week 9. If any users are
in desperate need for more disk space before that, contact
admin@dcsc.sdu.dk. Otherwise I urge all users to perform a little
diskcleaning, and remove unused files.
We plan a users meeting immediately after installation of
the additional nodes, hopefully early march, where details of the
installation and guidelines for using the new nodes will be discussed.
- 2002-11-01
Web site ammended with Benchmarks and deep link for statistics.
- 2002-10-10
Cluster-Computing Course October 15 and 16.
Info on course is posted here: Course
- 2002-09-12
Today we will be connecting the Odense and Lyngby
installations. We will use the black fiber parallel to the
Forskningsnettet to establish a 1 Gbps line connecting a total of 992
nodes together on one single LAN.
While the connection is present we will perform a LINPACK benchmark
on both clusters combined.
During the benchmark the PBS queue will be suspended and running
jobs paused. The will be resumed after the benchmark.
- 2002-08-30
Today we will shut down the cluster for the last
tweaking
- 2002-07-10
Another article has been spotted: Hardware-test.dk.
Furthermore there is an article in the paper version of IT2U from Dansk Metal.
- 2002-07-04
Due to further delayed delivery of the last machines and the server we do not anticipate to be able to start running grant jobs before August 1st. Let us hope that things change.
- 2002-07-03
Article in Computer World and Fyens Stiftstidende (login required)
- 2002-07-02
- Grand Press Day
Reporters and camera teams have been visiting us today. Keep an eye on TV2/Fyn (19:30), TV2 and DR1 (18:30) tonight.
- 2002-07-01
Compaq is not particularly clear in their information on arrival of machines.
Until now we have received 120 machines and we expect further 260 machines today even though all machines should have been here Thursday last week (according to Compaq's own information).
Compaq started the build of the last 140 machines Friday last week.
- 2002-06-21
You can now start requesting your accounts using the account application form
- 2002-06-20
Due to a postponed delivery of the machines we now anticipate a start date of July 15
- 2002-06-14
We anticipate to be able to start the whole system around July 1.