[PATCH V2 0/3] drivers/staging: zcache: dynamic page cache/swap compression

February 06th, 2011 - 11:20 pm ET by Dan Magenheimer | Report spam
[PATCH V2 0/3] drivers/staging: zcache: dynamic page cache/swap compression

(Historical note: This "new" zcache patchset supercedes both the
kztmem patchset and the "old" zcache patchset as described in:
http://lkml.org/lkml/2011/2/5/148)

HIGH LEVEL OVERVIEW

Zcache doubles RAM efficiency while providing a significant
performance boost on many workloads.

Summary for kernel DEVELOPERS: Zcache uses lzo1x compression to increase
RAM efficiency for both page cache and swap resulting in a significant
performance increase (3-4% or more) on memory-pressured workloads
due to a large reduction in disk I/O. To do this, zcache uses an
in-kernel (no virtualization required) implementation of transcendent
memory ("tmem"), which has other proven uses and intriguing future uses
as well.

Summary for kernel MAINTAINERS: Zcache is a fully-functional,
in-kernel (non-virtualization) implementation of transcendent memory
("tmem"), providing an in-kernel user for cleancache and frontswap.
The patch is based on 2.6.37 and requires either the cleancache patch
or the frontswap patch or both. The patch is proposed as a staging
driver to obtain broader exposure for further evolution and,
GregKH-willing, is merge-able at the next opportunity. Zcache will
hopefully also, Linus-and-akpm-willing, remove the barrier to merge-ability
for cleancache and frontswap. Please note that there is a dependency
on xvmalloc.[ch], currently in drivers/staging/zram.

For kernel USERS seeking a new toy: Want to try it out? A complete
monolithic patch for 2.6.37 including zcache, cleancache, and frontswap
can be downloaded at:
http://oss.oracle.com/projects/tmem...0205.patch
(IMPORTANT NOTE: zcache MUST be specified as a kernel boot parameter or
nothing happens!) And if you love to see tons of detailed statstics
changing dynamically try running the following bash script in a big window
with "watch -d":
http://oss.oracle.com/projects/tmem...ache-stats

VERSION HISTORY

Version 2 is a bit more restrictive of concurrency (disabling irqs
in gets and flushes) and fixes a build problem reported by gregkh.

Version 1 changed considerably from V0 thanks to some excellent feedback
from Jeremy Fitzhardinge.

Feedback from others would be greatly appreciated. See "SPECIFIC AREAS
FOR HELP/FEEDBACK" below.

"ACADEMIC" OVERVIEW

The objective of all of this code (including previously posted
cleancache and frontswap patches) is to provide a mechanism
by which the kernel can store a potentially huge amount of
certain kinds of page-oriented data so that it (the kernel)
can be more flexible, dynamic, and/or efficient in the amount
of directly-addressable RAM that it uses with little or no loss
of performance and, on some workloads and configuration, even a
substantial increase in performance.

The data store for this page-oriented data, called "page-
addressable memory", or "PAM", is assumed to be
cheaper, slower, more plentiful, and/or more idiosyncratic
than RAM, but faster, more-expensive, and/or scarcer than disk.
Data in this store is page-addressable only, not byte-addressable,
which increases flexibility for the methods by which the
data can be stored, for example allowing for compression and
efficient deduplication. Further, the number of pages that
can be stored is entirely dynamic, which allows for multiple
independent data sources to share PAM resources effectively
and securely.

Cleancache and frontswap are data sources for two types of this
page-oriented data: "ephemeral pages" such as clean page cache
pages that can be recovered elsewhere if necessary (e.g. from
disk); and "persistent" pages which are dirty pages that need
a short-term home to survive a brief RAM utilization spike but
need not be permanently saved to survive a reboot (e.g. swap).
The data source "puts" and "gets" pages and is also responsible
for directing coherency, via explicit "flushes" of pages and
related-groups of pages called "objects".

Transcendent memory, or "tmem", is a clean API/ABI that provides
for an efficient address translation layer and a set of highly
concurrent access methods to copy data between the data source
and the PAM data store. The first tmem implementation is in Xen.
This second tmem implementation is in-kernel (no virtualization
required) but is designed to be easily extensible for KVM or
possibly for cgroups.

A PAM data store must be fast enough to be accessed synchronously
since, when a put/get/flush is invoked by a data source, the
data transfer or invalidation is assumed to be completed on return.
The first PAM is implemented as a secure pool of Xen hypervisor memory
to allow highly-dynamic memory load balancing between guests.
This second PAM implementation uses in-kernel compression to roughly
halve RAM requirements for some workloads. Future proposed PAM
possibilities include: fast NVRAM, memory blades, far-far NUMA.
The clean layering provided here should simplify the implementation
of these future PAM data stores for Linux.

THIS PATCHSET

(NOTE: use requires cleancache and/or frontswap patches!)

This patchset provides an in-kernel implementation of transcendent
memory ("tmem") [1] and a PAM implementation where pages are compressed
and kept in kernel space (i.e. no virtualization, neither Xen nor KVM,
is required).

This patch is fully functional, but will benefit from some tuning and
some "policy" implementation. It demonstrates an in-kernel user for
the cleancache and frontswap patches [2,3] and, in many ways,
supplements/replaces the zram and "old" zcache patches [4,5] with a
more dynamic mechanism. Though some or all of this code may eventually
belong in mm or lib, this patch places it with staging drivers
so it can obtain exposure as its usage evolves.

The in-kernel transcendent memory implementation (see tmem.c)
conforms to the same ABI as the Xen tmem shim [6] but also provides
a generic interface to be used by one or more page-addressable
memory ("PAM") [7] implementations. This generic tmem code is
also designed to support multiple "clients", so should be easily
adaptable for KVM or possibly cgroups, allowing multiple guests
to more efficiently "timeshare" physical memory.

Zcache (see zcache.c) provides both "host" services (setup and
core memory allocation) for a single client for the generic tmem
code plus two different PAM implementations:

A. "compression buddies" ("zbud") which mates compression with a
shrinker interface to store ephemeral pages so they can be
easily reclaimed; compressed pages are paired and stored in
a physical page, resulting in higher internal fragmentation
B. a shim to xvMalloc [8] which is more space-efficient but
less receptive to page reclamation, so is fine for persistent
pages

Both of these use lzo1x compression (see linux/lib/lzo/*).

IMHO, it should be relatively easy to plug in other PAM implementations,
such as: PRAM [9], disaggregated memory [10], or far-far NUMA.

References:
[1] http://oss.oracle.com/projects/tmem
[2] http://lkml.org/lkml/2010/9/3/383
[3] https://lkml.org/lkml/2010/9/22/337
[4] http://lkml.org/lkml/2010/8/9/226
[5] http://lkml.org/lkml/2010/7/16/161
[6] http://lkml.org/lkml/2010/9/3/405
[7] http://marc.info/?l=linux-mm&m7811271605009
[8] http://code.google.com/p/compcache/wiki/xvMalloc
[9] http://www.linuxsymposium.org/2010/...ntent_kty5
[10] http://www.eecs.umich.edu/~tnm/trev_test/dissertationsPDF/kevinL.pdf

SPECIFIC REQUESTED AREAS FOR ADVICE/FEEDBACK

1. Some debugging code and extensive sysfs entries have been left in
place for this patch so its activity can be easily monitored. We welcome
other developers to play with it.
2. Little policy is in place (yet) to limit zcache from eventually
absorbing all free memory for compressed frontswap pages or
(if the shrinker isn't "fast enough") compressed cleancache
pages. On some workloads and some memory sizes, this eventually
results in OOMs. (In my testing, the OOM'ing is not worse, just
different.) We'd appreciate feedback on or patches that try
out various policies.
3. We've studied the GFP flags but am still not fully clear on the best
combination to use with zcache memory allocation. In particular,
We think "timid" GFP choices result in lower hit rate, while using
GFP_ATOMIC might be considered rude, but results in a higher hit
rate and may be fine for this usage. We'd appreciate guidance on this.
4. We think we have the irq/softirq/premption code correct but we're
definitely not expert in this area, so review would be appreciated.
5. Cleancache works best when the "clean working set" is larger
than the active file cache, but smaller than the memory available
for cleancache store. This scenario can be difficult to duplicate
in a kernel with fixed RAM size. For best results, zcache may benefit
from tuning changes to file cache parameters.
6. Benchmarking: Theoretically, zcache should have a negligible
worst case performance loss and a substantial best case performance
gain. Older processors may show a bigger worst case hit. We'd
appreciate any help running workloads on different boxes to better
characterize worst case and best case performance.

Signed-off-by: Dan Magenheimer <dan.magenheimer@oracle.com>
Signed-off-by: Nitin Gupta <ngupta@vflare.org>

drivers/staging/Kconfig | 2
drivers/staging/Makefile | 1
drivers/staging/zcache/Kconfig | 13
drivers/staging/zcache/Makefile | 1
drivers/staging/zcache/tmem.c | 710 +++++++++++++++++
drivers/staging/zcache/tmem.h | 195 ++++
drivers/staging/zcache/zcache.c | 1657 ++++++++++++++++++++++++++++++++++++++++
7 files changed, 2579 insertions(+)
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/
email Follow the discussionReplies 6 repliesReplies Make a reply

Similar topics

Replies

#1 Dan Magenheimer
February 08th, 2011 - 08:10 pm ET | Report spam
(Historical note: This "new" zcache patchset supercedes both the
kztmem patchset and the "old" zcache patchset as described in:
http://lkml.org/lkml/2011/2/5/148)



(In order to move discussion from the old kztmem patchset to
the new zcache patchset, I am replying here to Matt's email
sent at: https://lkml.org/lkml/2011/2/4/199 )

From: Matt [mailto:]



Hi Matt --

Thanks for all the thoughtful work and questions! Sorry it
took me a few days to reply...

This finally makes Cleancache's functionality usable for desktop and
other small device (non-enterprise) users (especially regarding
frontswap) :)

2) feedback

WARNING: at kernel/softirq.c:159 local_bh_enable+0xba/0x110()



These should be gone in V2.

I also observed that it takes some time until volumes (which use
kztmem's ephemeral nodes) are unmounted - probably due to emptying
slub/slab taking longer - so this should be normal.



If "some time" becomes a problem, I have a design in my
head how to fix this. But I'll consider it lower priority
for now.

2.2) a user (32bit box) who's running a pretty similar kernel to mine
(details later) has had some assert_spinlocks thrown while



The specific sequence of asserts indicates a race, but I think
a harmless one. I haven't been able to reproduce it and stared
at various race possibilities for a couple of hours without
luck. (Aha! There it is! Oops, no that's not it. Repeat.)
Hopefully getting broader exposure to more experienced
kernel developers will help find/fix this one.

2.3) rsync-operations seemed to speed up quite noticably to say the
least (significantly)
:
so job (2) could be cut by 1-2 minutes. Unmounting the drive/partition
:
So kztmem also seems to help where low latency needs to be met, e.g.
pro-audio.
:
So productivity is improved quite a lot.



Thanks for running some performance tests on a broader set of
test cases! The numbers look very nice!

Questions:
• What exactly is kztmem?
∘ is it a tmem similar functionality like provided in the project
"Xen's Transcent Memory"
∘ and zmem is simply a "plugin" for memory compression support to tmem
? (is that what zcache does ?)
• so simplified (superficially without taking into account advantages
or certain unique characteristics) some equivalents:
∘ frontswap == ramzswap
∘ kztmem == zcache
∘ cleancache == is the "core", "mastermind" or "hypervisor" behind all
this, making frontswap and kztmem kind of "plugins" for it ?



This is best described in the "Academic Overview" section
of PATCH V2 0/3: https://lkml.org/lkml/2011/2/6/346
Cleancache and frontswap are "data sources" for page-oriented
data that can easily be stored in "transcendent memory"
(aka "tmem"). Once pages of data are accessible only via tmem,
lots of things can be done to the data, including compression,
deduplication, being sent to the hypervisor, etc.

So kztmem (or more accurately: cleancache) is open for adding more
functionality in the future ?



Very definitely... I'm working on another interesting use
model right now!

• What are advantages of kztmem compared to ramzswap ("compcache") &
zcache ? From what I understood - it's more dynamic in it's nature
than compcache & zcache: they need to preallocate predetermined amount
of memory, several "ram-drives" would be needed for SMP-scalability
∘ whereas this (pre-allocated RAM and multiple "ram-drives" aren't
needed for kztmem, cleancache and frontswap since cleancache,
frontswap & kztmem are concurrency-safe and dynamic (according to
documentation) ?



Yes, that's a good overview of the differences.

• Coming back to usage of compcache - how about the problem of 60%
memory fragmentation (according to compcache/zcache wiki,
http://code.google.com/p/compcache/...gmentation) ?
Could the situation be improved with in-kernel "memory compaction" ?
I'm not a developer so I don't know exactly how lumpy reclaim/memory
compaction and xvmalloc would interact with each other



Nitin is the expert on compcache and xvmalloc, so I will leave
this question unanswered for now.

• According to the Documentation you posted "e.g. a ram-based FS such
as tmpfs should not enable cleancache" - so it's not using block i/o
layer ? what are the performance or other advantages of that approach
?



Correct, no block i/o layer involved. The block i/o layer is
optimized for disks (though it is slowly becoming adapted to
faster devices). The real "advantage" is that EVERY put/get
has immediate feedback and this is very important to making
things as dynamic as possible.

• Is there support for XFS or reiserfs - how difficult would it be to
add that ?



I'm not familiar with either, but most filesystems are easy to
add... I'm just not able to do the testing. If zcache moves
into upstream, other filesystem experts should be able to try
zcache easily on other filesystems.

• Very interesting would be: support for FUSE (taking into account zfs
and ntfs3g, etc.) - would that be possible ?



I don't know enough about those to feel comfortable answering,
but would be happy to consult if someone else wants to try it.

• Was there testing done on 32bit boxes ? How about alternative
architectures such as ARM, PPC, etc. ?
∘ I'm especially interested in ARM since surely a lot on the



Sadly, I haven't done any testing on 32-bit boxes. All the code
is designed to be entirely architecture-independent though I'm
sure a bug or three will be found on other architectures.

be / Is there a port of cleancache, kztmem and frontswap available for
2.6.32* kernels ? (most android devices are currently running those)



I've found porting cleancache and frontswap to other recent
Linux versions to be straightforward. And zcache is just a
staging driver so should also port easily.

• Considerung UP boxes - is the usage even beneficial on those ?
∘ If not - why not (written in the documentation) - due to missing raw
CPU power ?



Should work fine on a UP box. The majority of the performance
advantage is "converting" disk seek wait time into CPU compress/
decompress time.

• How is the scaling ? In case of Multiprocessors - are the
operations/parallelism or concurrency, how it's called, realized
through "work queues" - (there have been lots of changes recently in
the kernel [2.6.37, 2.6.38]). ?



Good questions. The concurrency should be pretty good, but in
the current version, interrupts are disabled during compression,
which could lead to some problems in a more real-time load.
This design is fixable but will take some work.

• Are there higher latencies during high memory pressure or high CPU
load situations, e.g. where the latencies would even go down below
without usage of kztmem ?



Theoretically, if there is no disk wait time (e.g. CPUs are always
loaded even during disk reads) AND there is high disk demand,
zcache could cause a reduction in performance.

• The compression algorithm in use seems to be lzo. Are any additional
selectable compressions planned such as lzf, gzip - maybe even bzip2 ?
- Would they be selectable via Kconfig ?
∘ are these threaded / scaling with multiple processors - e.g. like pcrypt ?



Good ideas for future enhancements!

• "Exactly how much memory it provides is entirely dynamic and
random." - can maximum limits be set ? ("watermarks" ? - if that is
the correct term)
How efficient is the algorithm ? What is it based on ?



For cleancache pages, all can be reclaimed so no maximum needs
to be set as long as the kernel reclaim mechanism is working properly.
For frontswap pages, there is a maximum currently hardcoded,
but this could be changed to be handled through a /sys fs file.

• Can the operations be sped up even more using spice() system call or
something similar (if existant) - if even applicable ?



Sorry, I don't know the answer to this.

• Are userland hooks planned ? e.g. for other virtualization solutions
such as KVM, qemu, etc.



We've thought of userland hooks, but haven't tried them yet.

KVM should be able to take advantage of zcache with a little effort.

• How about deduplication support for the ephemeral (filesystem) pools?
∘ in my (humble) opinion this might be really useful - since in the
future there will be more and more CPU power but due to available RAM
not growing as linear (or fast) as CPU's power this could be a kind of
compensation to gain more memory
∘ would that work with "Kernel Samepage Merging"?
∘ is KSM even similar to tmem's deduplication functionality (tmem -
which is used or planned for Xen)
Referring to http://marc.info/?l=linux-kernel&m9683713531791&w=2
slides 20 to 21 on the presentation deduplication would seem much more
efficient than KSM.



Deduplication support could be added.

Kztmem seems to be quite useful on memory constrained devices:



You have suggested several interesting possibilities!

If I've missed anything important, please let me know!

Thanks again!
Dan
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/
Replies Reply to this message
#2 Dan Magenheimer
February 08th, 2011 - 08:10 pm ET | Report spam
(Historical note: This "new" zcache patchset supercedes both the
kztmem patchset and the "old" zcache patchset as described in:
http://lkml.org/lkml/2011/2/5/148)



(In order to move discussion from the old kztmem patchset to
the new zcache patchset, I am replying here to Matt's email
sent at: https://lkml.org/lkml/2011/2/4/199 )

From: Matt [mailto:]



Hi Matt --

Thanks for all the thoughtful work and questions! Sorry it
took me a few days to reply...

This finally makes Cleancache's functionality usable for desktop and
other small device (non-enterprise) users (especially regarding
frontswap) :)

2) feedback

WARNING: at kernel/softirq.c:159 local_bh_enable+0xba/0x110()



These should be gone in V2.

I also observed that it takes some time until volumes (which use
kztmem's ephemeral nodes) are unmounted - probably due to emptying
slub/slab taking longer - so this should be normal.



If "some time" becomes a problem, I have a design in my
head how to fix this. But I'll consider it lower priority
for now.

2.2) a user (32bit box) who's running a pretty similar kernel to mine
(details later) has had some assert_spinlocks thrown while



The specific sequence of asserts indicates a race, but I think
a harmless one. I haven't been able to reproduce it and stared
at various race possibilities for a couple of hours without
luck. (Aha! There it is! Oops, no that's not it. Repeat.)
Hopefully getting broader exposure to more experienced
kernel developers will help find/fix this one.

2.3) rsync-operations seemed to speed up quite noticably to say the
least (significantly)
:
so job (2) could be cut by 1-2 minutes. Unmounting the drive/partition
:
So kztmem also seems to help where low latency needs to be met, e.g.
pro-audio.
:
So productivity is improved quite a lot.



Thanks for running some performance tests on a broader set of
test cases! The numbers look very nice!

Questions:
• What exactly is kztmem?
∘ is it a tmem similar functionality like provided in the project
"Xen's Transcent Memory"
∘ and zmem is simply a "plugin" for memory compression support to tmem
? (is that what zcache does ?)
• so simplified (superficially without taking into account advantages
or certain unique characteristics) some equivalents:
∘ frontswap == ramzswap
∘ kztmem == zcache
∘ cleancache == is the "core", "mastermind" or "hypervisor" behind all
this, making frontswap and kztmem kind of "plugins" for it ?



This is best described in the "Academic Overview" section
of PATCH V2 0/3: https://lkml.org/lkml/2011/2/6/346
Cleancache and frontswap are "data sources" for page-oriented
data that can easily be stored in "transcendent memory"
(aka "tmem"). Once pages of data are accessible only via tmem,
lots of things can be done to the data, including compression,
deduplication, being sent to the hypervisor, etc.

So kztmem (or more accurately: cleancache) is open for adding more
functionality in the future ?



Very definitely... I'm working on another interesting use
model right now!

• What are advantages of kztmem compared to ramzswap ("compcache") &
zcache ? From what I understood - it's more dynamic in it's nature
than compcache & zcache: they need to preallocate predetermined amount
of memory, several "ram-drives" would be needed for SMP-scalability
∘ whereas this (pre-allocated RAM and multiple "ram-drives" aren't
needed for kztmem, cleancache and frontswap since cleancache,
frontswap & kztmem are concurrency-safe and dynamic (according to
documentation) ?



Yes, that's a good overview of the differences.

• Coming back to usage of compcache - how about the problem of 60%
memory fragmentation (according to compcache/zcache wiki,
http://code.google.com/p/compcache/...gmentation) ?
Could the situation be improved with in-kernel "memory compaction" ?
I'm not a developer so I don't know exactly how lumpy reclaim/memory
compaction and xvmalloc would interact with each other



Nitin is the expert on compcache and xvmalloc, so I will leave
this question unanswered for now.

• According to the Documentation you posted "e.g. a ram-based FS such
as tmpfs should not enable cleancache" - so it's not using block i/o
layer ? what are the performance or other advantages of that approach
?



Correct, no block i/o layer involved. The block i/o layer is
optimized for disks (though it is slowly becoming adapted to
faster devices). The real "advantage" is that EVERY put/get
has immediate feedback and this is very important to making
things as dynamic as possible.

• Is there support for XFS or reiserfs - how difficult would it be to
add that ?



I'm not familiar with either, but most filesystems are easy to
add... I'm just not able to do the testing. If zcache moves
into upstream, other filesystem experts should be able to try
zcache easily on other filesystems.

• Very interesting would be: support for FUSE (taking into account zfs
and ntfs3g, etc.) - would that be possible ?



I don't know enough about those to feel comfortable answering,
but would be happy to consult if someone else wants to try it.

• Was there testing done on 32bit boxes ? How about alternative
architectures such as ARM, PPC, etc. ?
∘ I'm especially interested in ARM since surely a lot on the



Sadly, I haven't done any testing on 32-bit boxes. All the code
is designed to be entirely architecture-independent though I'm
sure a bug or three will be found on other architectures.

be / Is there a port of cleancache, kztmem and frontswap available for
2.6.32* kernels ? (most android devices are currently running those)



I've found porting cleancache and frontswap to other recent
Linux versions to be straightforward. And zcache is just a
staging driver so should also port easily.

• Considerung UP boxes - is the usage even beneficial on those ?
∘ If not - why not (written in the documentation) - due to missing raw
CPU power ?



Should work fine on a UP box. The majority of the performance
advantage is "converting" disk seek wait time into CPU compress/
decompress time.

• How is the scaling ? In case of Multiprocessors - are the
operations/parallelism or concurrency, how it's called, realized
through "work queues" - (there have been lots of changes recently in
the kernel [2.6.37, 2.6.38]). ?



Good questions. The concurrency should be pretty good, but in
the current version, interrupts are disabled during compression,
which could lead to some problems in a more real-time load.
This design is fixable but will take some work.

• Are there higher latencies during high memory pressure or high CPU
load situations, e.g. where the latencies would even go down below
without usage of kztmem ?



Theoretically, if there is no disk wait time (e.g. CPUs are always
loaded even during disk reads) AND there is high disk demand,
zcache could cause a reduction in performance.

• The compression algorithm in use seems to be lzo. Are any additional
selectable compressions planned such as lzf, gzip - maybe even bzip2 ?
- Would they be selectable via Kconfig ?
∘ are these threaded / scaling with multiple processors - e.g. like pcrypt ?



Good ideas for future enhancements!

• "Exactly how much memory it provides is entirely dynamic and
random." - can maximum limits be set ? ("watermarks" ? - if that is
the correct term)
How efficient is the algorithm ? What is it based on ?



For cleancache pages, all can be reclaimed so no maximum needs
to be set as long as the kernel reclaim mechanism is working properly.
For frontswap pages, there is a maximum currently hardcoded,
but this could be changed to be handled through a /sys fs file.

• Can the operations be sped up even more using spice() system call or
something similar (if existant) - if even applicable ?



Sorry, I don't know the answer to this.

• Are userland hooks planned ? e.g. for other virtualization solutions
such as KVM, qemu, etc.



We've thought of userland hooks, but haven't tried them yet.

KVM should be able to take advantage of zcache with a little effort.

• How about deduplication support for the ephemeral (filesystem) pools?
∘ in my (humble) opinion this might be really useful - since in the
future there will be more and more CPU power but due to available RAM
not growing as linear (or fast) as CPU's power this could be a kind of
compensation to gain more memory
∘ would that work with "Kernel Samepage Merging"?
∘ is KSM even similar to tmem's deduplication functionality (tmem -
which is used or planned for Xen)
Referring to http://marc.info/?l=linux-kernel&m9683713531791&w=2
slides 20 to 21 on the presentation deduplication would seem much more
efficient than KSM.



Deduplication support could be added.

Kztmem seems to be quite useful on memory constrained devices:



You have suggested several interesting possibilities!

If I've missed anything important, please let me know!

Thanks again!
Dan
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/
Replies Reply to this message
#3 Nitin Gupta
February 08th, 2011 - 09:40 pm ET | Report spam
On 02/08/2011 08:03 PM, Dan Magenheimer wrote:
(Historical note: This "new" zcache patchset supercedes both the
kztmem patchset and the "old" zcache patchset as described in:
http://lkml.org/lkml/2011/2/5/148)



(In order to move discussion from the old kztmem patchset to
the new zcache patchset, I am replying here to Matt's email
sent at: https://lkml.org/lkml/2011/2/4/199 )

From: Matt [mailto:]





<snip>


• Coming back to usage of compcache - how about the problem of 60%
memory fragmentation (according to compcache/zcache wiki,
http://code.google.com/p/compcache/...gmentation) ?
Could the situation be improved with in-kernel "memory compaction" ?
I'm not a developer so I don't know exactly how lumpy reclaim/memory
compaction and xvmalloc would interact with each other



Nitin is the expert on compcache and xvmalloc, so I will leave
this question unanswered for now.





I'm currently in the process of designing a new allocator that gives
predictable memory fragmentation guarantees (at the expense of extra CPU
cycles). I've not yet posted details anywhere but many of the ideas are
from the "Compact Fit" allocator:
http://www.usenix.org/event/usenix0...unas_html/

I'm not sure how much time it will take since I'm not yet done with some
of the design details, and then userspace implementation, testing,
profiling and finally kernel port. Add to that extra concurrency issues
when integrating with zcache!

Thanks,
Nitin
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/
Replies Reply to this message
#4 Matt
February 13th, 2011 - 07:10 pm ET | Report spam
On Wed, Feb 9, 2011 at 1:03 AM, Dan Magenheimer
wrote:
[snip]

If I've missed anything important, please let me know!

Thanks again!
Dan




Hi Dan,

thank you so much for answering my email in such detail !

I shall pick up on that mail in my next email sending to the mailing list :)


currently I've got a problem with btrfs which seems to get triggered
by cleancache get-operations:


Feb 14 00:37:19 lupus kernel: [ 2831.297377] device fsid
354120c992a00761-5fa07d400126a895 devid 1 transid 7
/dev/mapper/portage
Feb 14 00:37:19 lupus kernel: [ 2831.297698] btrfs: enabling disk space caching
Feb 14 00:37:19 lupus kernel: [ 2831.297700] btrfs: force lzo compression
Feb 14 00:37:19 lupus kernel: [ 2831.315844] zcache: created ephemeral
tmem pool, id=3
Feb 14 00:39:20 lupus kernel: [ 2951.853188] BUG: unable to handle
kernel paging request at 0000000001400050
Feb 14 00:39:20 lupus kernel: [ 2951.853219] IP: [<ffffffff8133ef1b>]
btrfs_encode_fh+0x2b/0x120
Feb 14 00:39:20 lupus kernel: [ 2951.853242] PGD 0
Feb 14 00:39:20 lupus kernel: [ 2951.853251] Oops: 0000 [#1] PREEMPT SMP
Feb 14 00:39:20 lupus kernel: [ 2951.853275] last sysfs file:
/sys/devices/platform/coretemp.3/temp1_input
Feb 14 00:39:20 lupus kernel: [ 2951.853295] CPU 4
Feb 14 00:39:20 lupus kernel: [ 2951.853303] Modules linked in: radeon
ttm drm_kms_helper cfbcopyarea cfbimgblt cfbfillrect ipt_REJECT
ipt_LOG xt_limit xt_tcpudp xt_state nf_nat_irc nf_conntrack_irc
nf_nat_ftp nf_nat nf_conntrack_ipv4 nf_defrag_ipv4 nf_conntrack_ftp
iptable_filter ipt_addrtype xt_DSCP xt_dscp xt_iprange ip_tables
ip6table_filter xt_NFQUEUE xt_owner xt_hashlimit xt_conntrack xt_mark
xt_multiport xt_connmark nf_conntrack xt_string ip6_tables x_tables
it87 hwmon_vid coretemp snd_seq_dummy snd_seq_oss snd_seq_midi_event
snd_seq snd_seq_device snd_pcm_oss snd_mixer_oss snd_hda_codec_hdmi
snd_hda_codec_realtek snd_hda_intel snd_hda_codec snd_hwdep snd_pcm
snd_timer snd soundcore i2c_i801 wmi e1000e shpchp snd_page_alloc
libphy e1000 scsi_wait_scan sl811_hcd ohci_hcd ssb usb_storage
ehci_hcd [last unloaded: tg3]
Feb 14 00:39:20 lupus kernel: [ 2951.853682]
Feb 14 00:39:20 lupus kernel: [ 2951.853690] Pid: 11394, comm:
btrfs-transacti Not tainted 2.6.37-plus_v16_zcache #4 FMP55/ipower
G3710
Feb 14 00:39:20 lupus kernel: [ 2951.853725] RIP:
0010:[<ffffffff8133ef1b>] [<ffffffff8133ef1b>]
btrfs_encode_fh+0x2b/0x120
Feb 14 00:39:20 lupus kernel: [ 2951.853751] RSP:
0018:ffff880129a11b00 EFLAGS: 00010246
Feb 14 00:39:20 lupus kernel: [ 2951.853767] RAX: 00000000000000ff
RBX: ffff88014a1ce628 RCX: 0000000000000000
Feb 14 00:39:20 lupus kernel: [ 2951.853788] RDX: ffff880129a11b3c
RSI: ffff880129a11b70 RDI: 0000000000000006
Feb 14 00:39:20 lupus kernel: [ 2951.853808] RBP: 0000000001400000
R08: ffffffff8133eef0 R09: ffff880129a11c68
Feb 14 00:39:20 lupus kernel: [ 2951.853829] R10: 0000000000000001
R11: 0000000000000001 R12: ffff88014a1ce780
Feb 14 00:39:20 lupus kernel: [ 2951.853849] R13: ffff88021fefc000
R14: ffff88021fef9000 R15: 0000000000000000
Feb 14 00:39:20 lupus kernel: [ 2951.853870] FS:
0000000000000000(0000) GS:ffff8800bf500000(0000)
knlGS:0000000000000000
Feb 14 00:39:20 lupus kernel: [ 2951.853894] CS: 0010 DS: 0000 ES:
0000 CR0: 000000008005003b
Feb 14 00:39:20 lupus kernel: [ 2951.853911] CR2: 0000000001400050
CR3: 0000000001c27000 CR4: 00000000000006e0
Feb 14 00:39:20 lupus kernel: [ 2951.853932] DR0: 0000000000000000
DR1: 0000000000000000 DR2: 0000000000000000
Feb 14 00:39:20 lupus kernel: [ 2951.853952] DR3: 0000000000000000
DR6: 00000000ffff0ff0 DR7: 0000000000000400
Feb 14 00:39:20 lupus kernel: [ 2951.853973] Process btrfs-transacti
(pid: 11394, threadinfo ffff880129a10000, task ffff880202e4ac40)
Feb 14 00:39:20 lupus kernel: [ 2951.853999] Stack:
Feb 14 00:39:20 lupus kernel: [ 2951.854006] ffff880129a11b50
ffff880000000003 ffff88003c60a098 0000000000000003
Feb 14 00:39:20 lupus kernel: [ 2951.854035] ffffffffffffffff
ffffffff810e6aaa 0000000000000000 0000000602e4ac40
Feb 14 00:39:20 lupus kernel: [ 2951.854063] ffffffff8133e3f0
ffffffff810e6cee 0000000000001000 0000000000000000
Feb 14 00:39:20 lupus kernel: [ 2951.854092] Call Trace:
Feb 14 00:39:20 lupus kernel: [ 2951.854103] [<ffffffff810e6aaa>] ?
cleancache_get_key+0x4a/0x60
Feb 14 00:39:20 lupus kernel: [ 2951.854122] [<ffffffff8133e3f0>] ?
btrfs_wake_function+0x0/0x20
Feb 14 00:39:20 lupus kernel: [ 2951.854140] [<ffffffff810e6cee>] ?
__cleancache_flush_inode+0x3e/0x70
Feb 14 00:39:20 lupus kernel: [ 2951.854161] [<ffffffff810b34d2>] ?
truncate_inode_pages_range+0x42/0x440
Feb 14 00:39:20 lupus kernel: [ 2951.854182] [<ffffffff812f115e>] ?
btrfs_search_slot+0x89e/0xa00
Feb 14 00:39:20 lupus kernel: [ 2951.854201] [<ffffffff810c3a45>] ?
unmap_mapping_range+0xc5/0x2a0
Feb 14 00:39:20 lupus kernel: [ 2951.854220] [<ffffffff810b3930>] ?
truncate_pagecache+0x40/0x70
Feb 14 00:39:20 lupus kernel: [ 2951.854240] [<ffffffff813458b1>] ?
btrfs_truncate_free_space_cache+0x81/0xe0
Feb 14 00:39:20 lupus kernel: [ 2951.854261] [<ffffffff812fce15>] ?
btrfs_write_dirty_block_groups+0x245/0x500
Feb 14 00:39:20 lupus kernel: [ 2951.854283] [<ffffffff812fcb6a>] ?
btrfs_run_delayed_refs+0x1ba/0x220
Feb 14 00:39:20 lupus kernel: [ 2951.854304] [<ffffffff8130afff>] ?
commit_cowonly_roots+0xff/0x1d0
Feb 14 00:39:20 lupus kernel: [ 2951.854323] [<ffffffff8130c583>] ?
btrfs_commit_transaction+0x363/0x760
Feb 14 00:39:20 lupus kernel: [ 2951.854344] [<ffffffff81067ea0>] ?
autoremove_wake_function+0x0/0x30
Feb 14 00:39:20 lupus kernel: [ 2951.854364] [<ffffffff81305bc3>] ?
transaction_kthread+0x283/0x2a0
Feb 14 00:39:20 lupus kernel: [ 2951.854383] [<ffffffff81305940>] ?
transaction_kthread+0x0/0x2a0
Feb 14 00:39:20 lupus kernel: [ 2951.854401] [<ffffffff81305940>] ?
transaction_kthread+0x0/0x2a0
Feb 14 00:39:20 lupus kernel: [ 2951.854420] [<ffffffff81067a16>] ?
kthread+0x96/0xa0
Feb 14 00:39:20 lupus kernel: [ 2951.854437] [<ffffffff81003514>] ?
kernel_thread_helper+0x4/0x10
Feb 14 00:39:20 lupus kernel: [ 2951.854455] [<ffffffff81067980>] ?
kthread+0x0/0xa0
Feb 14 00:39:20 lupus kernel: [ 2951.854471] [<ffffffff81003510>] ?
kernel_thread_helper+0x0/0x10
Feb 14 00:39:20 lupus kernel: [ 2951.854488] Code: 55 b8 ff 00 00 00
53 48 89 fb 48 83 ec 18 48 8b 6f 10 8b 3a 83 ff 04 0f 86 d5 00 00 00
85 c9 0f 95 c1 83 ff 07 0f 86 d5 00 00 00 <48> 8b 45 50 bf 05 00 00 00
48 89 06 84 c9 48 8b 85 68 fe ff ff
Feb 14 00:39:20 lupus kernel: [ 2951.854742] RIP [<ffffffff8133ef1b>]
btrfs_encode_fh+0x2b/0x120
Feb 14 00:39:20 lupus kernel: [ 2951.854762] RSP <ffff880129a11b00>
Feb 14 00:39:20 lupus kernel: [ 2951.854773] CR2: 0000000001400050
Feb 14 00:39:20 lupus kernel: [ 2951.860906] [ end trace
f831c5ceeaa49287 ]

in my case I had compress-force with lzo and disk_cache enabled


another user of the kernel I'm currently running has had the same
problem with zcache
(http://forums.gentoo.org/viewtopic-...ml#6571799)

(looks like in his case compression and any other fancy additional
features weren't enabled)


changes made by this kernel or patchset to btrfs are from
* io-less dirty throttling patchset (44 patches)
* zcache V2 ("[PATCH] staging: zcache: fix memory leak" should be
applied in both cases)
* PATCH] fix (latent?) memory corruption in btrfs_encode_fh()
* btrfs-unstable changes to state of
3a90983dbdcb2f4f48c0d771d8e5b4d88f27fae6 (so practically equals btrfs
from 2.6.38-rc4+)

I haven't tried downgrading to vanilla 2.6.37 with zcache only, yet,

but kind of upgraded btrfs to the latest state of the btrfs-unstable
repository (http://git.eu.kernel.org/?p=linux/k...;a=summary)
namely 3a90983dbdcb2f4f48c0d771d8e5b4d88f27fae6

this also didn't help and seemed to produce the same error-message

so to summarize:

1) error message appearing with all 4 patchsets applied changing
btrfs-code and compress-force=lzo and disk_cache enabled

2) error message appearing with default mount-options and btrfs from
2.6.37 and changes for zcache & io-less dirty throttling patchset
applied (first 2 patch(sets)) from list)


in my case I tried to extract / play back a 1.7 GiB tarball of my
portage-directory (lots of small files and some tar.bzip2 archives)
via pbzip2 or 7z when the error happened and the message was shown

Due to KMS sound (webradio streaming) was still running but I couldn't
continue work (X switching to kernel output) so I did the magic sysrq
combo (reisub)


Does that BUG message ring a bell for anyone ?

(if I should leave out anyone from the CC in the next emails or
future, please holler - I don't want to spam your inboxes)

Thanks

Matt
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/
Replies Reply to this message
#5 Minchan Kim
February 15th, 2011 - 11:40 pm ET | Report spam
On Wed, Feb 16, 2011 at 10:27 AM, Dan Magenheimer
wrote:
From: Matt [mailto:]
Sent: Tuesday, February 15, 2011 5:12 PM
To: Minchan Kim
Cc: Dan Magenheimer; ; Chris Mason; linux-
; ; ; linux-
; Josef Bacik; Dan Rosenberg; Yan Zheng;
; Li Zefan
Subject: Re: [PATCH V2 0/3] drivers/staging: zcache: dynamic page
cache/swap compression

On Mon, Feb 14, 2011 at 4:35 AM, Minchan Kim
> Just my guessing. I might be wrong.
>
> __cleancache_flush_inode calls cleancache_get_key with
cleancache_filekey.
> cleancache_file_key's size is just 6 * u32.
> cleancache_get_key calls btrfs_encode_fh with the key.
> but btrfs_encode_fh does typecasting the key to btrfs_fid which is
> bigger size than cleancache_filekey's one so it should not access
> fields beyond cleancache_get_key.
>
> I think some file systems use extend fid so in there, this problem
can
> happen. I don't know why we can't find it earlier. Maybe Dan and
> others test it for a long time.
>
> Am I missing something?
>
>
>
> Kind regards,
> Minchan Kim
>

reposting Minchan's message for reference to the btrfs mailing list
while also adding

Li Zefan, Miao Xie, Yan Zheng, Dan Rosenberg and Josef Bacik to CC

Regards

Matt



Hi Matt and Minchan --

(BTRFS EXPERTS SEE *** BELOW)

I definitely see a bug in cleancache_get_key in the monolithic
zcache+cleancache+frontswap patch I posted on oss.oracle.com
that is corrected in linux-next but I don't see how it could
get provoked by btrfs.

The bug is that, in cleancache_get_key, the return value of fhfn should
be checked against 255.  If the return value is 255, cleancache_get_key
should return -1.  This should disable cleancache for any filesystem
where KEY_MAX is too large.

But cleancache_get_key always calls fhfn with connectable == 0 and
CLEANCACHE_KEY_MAX==6 should be greater than BTRFS_FID_SIZE_CONNECTABLE
(which I think should be 5?).  And the elements written into the
typecast btrfs_fid should be only writing the first 5 32-bit words.



BTRFS_FID_SIZE_NON_CONNECTALBE is 5, not BTRFS_FID_SIZE_CONNECTABLE.
Anyway, you passed connectable with 0 so it should be only writing the
first 5 32-bit words as you said.
That's one I missed. ;-)

Thanks.
Kind regards,
Minchan Kim
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/
Replies Reply to this message
Help Create a new topicNext page Replies Make a reply
Search Make your own search